DEV Community: bytewax

M12 invests in the Future of Stream Processing with Bytewax

Oli Makhasoeva — Wed, 09 Aug 2023 16:25:50 +0000

At Bytewax, we're passionate about the power of real-time data. With AI and automation on the rise, accessing data instantly isn't just a cool perk—it's becoming a necessity. Our mission is to build software that will strip away the complexities of streaming and make it accessible for every developer to build real-time data applications.

We started with the Rust powered, open source Python stream processor, Bytewax, which is now a year and a half old, debuting in February 2022. Since starting the project we have grown and matured the Bytewax open source offering to include persistent state, different windowing configurations, and new operators for increased performance and scalability. We have also focused on bettering the developer experience from integration to deployment with our deployment tool, waxctl, the ability to rescale without losing data stored in state, and the ability to connect to various input and output sources as well as build your own.

We are excited to announce a new partner along our journey in M12/GitHub with their investment in Bytewax to support further development on the open source as well as the development of the Bytewax Platform, which will help businesses scale out their Bytewax usage starting with features like disaster recovery, collaboration and observability tools and a management layer.

How Bytewax Supports AI and Real-Time Applications

The world has moved into a new wave of computing where businesses power their operations and consumer interactions with AI. Sophisticated AI models require a real-time understanding of the world to make accurate decisions. What is often referred to as real-time ML is when a system reacts in real-time with a decision powered by an ML model to the inputs it receives. Stream processing and more importantly stream processing with a Python interface is pivotal for Real-time ML in order to transform data into features for models.

You can read more about real-time ML with Bytewax in our blog post here.

There are many other use cases currently being powered by Bytewax from monitoring and reacting to IoT sensors for vehicle fleets or across the energy grid, to monitoring market data or analyzing infrastructure.

New Investment: A Vote of Confidence

Microsoft is known for its investments in Python and AI. Creating partnerships with pivotal developers and teams that are moving the industry forward. Their investment in Bytewax is a vote of confidence towards the Bytewax vision and mission and the importance of stream processing in the next wave of computing.

“We believe that Zander and the Bytewax team are building a cutting edge tool that simplifies event and stream processing, and appreciate their thoughtful technical approach leveraging a Python framework to build highly scalable streaming dataflows” said Priyanka Mitra, Partner at M12 and co-founder of the M12 GitHub Fund. “We are impressed with their engagement of the open-source community and are committed to supporting Bytewax in accomplishing their mission, especially as they explore cutting edge AI and ML use cases” she added.

Future Bytewax

The Microsoft investment will help Bytewax establish a thriving community around the open source project and build out features for the paid platform to support adoption of the technology. We have been working to solve exceptionally hard problems like rescaling dataflows and cloud backup for disaster recovery as well as improving performance. We are excited to continue to bring features like these to Bytewax with a simple user interface and low complexity to support users across all stages of their journey.

Connect with us

We would love to hear from our users and any Python and streaming enthusiasts on how we can increase our support for workloads and Python development patterns. Please feel free to reach out via our slack community or the GitHub repo. We would also like to take this opportunity to thank our users, investors, and community for their continued support! If you like what we are building, please ⭐ the repo 😀.

Data Parallel, Task Parallel, and Agent Actor Architectures

Zander — Thu, 13 Jul 2023 19:26:35 +0000

Introduction:

In the rapidly evolving world of data processing, understanding the various architectural approaches is pivotal to choosing the right tools for your specific needs. The three dominant architectures that have emerged—data parallel, task parallel, and agent actor—each offer unique strengths that cater to different types of data workloads.

Data parallel architectures shine when large datasets need to be processed in parallel. This model divides data into smaller chunks, each processed independently but in the same manner on different workers or nodes. Apache Spark, a well-known data processing framework, uses this architecture. Spark's resilience, capacity for handling vast amounts of data, and ability to perform complex transformations make it a favorite in big data landscapes. Bytewax also follows this model with the same transformations happening on each worker, but on different data.

On the other hand, task parallel architectures, as exemplified by Apache Flink and Dask, focus on executing different tasks concurrently across distributed systems. This approach is particularly effective for workflows with a wide variety of tasks that can be performed independently or have complex dependencies. Flink's stream-first philosophy provides robustness for real-time processing tasks, while Dask's flexibility makes it a great choice for parallel computing tasks in Python environments.

Finally, the agent actor architecture, the foundation for Ray, presents a flexible and robust model for handling complex, stateful, and concurrent computations. In this model, "actors" encapsulate state and behavior, communicating through message passing. Ray's ability to scale from a single node to a large cluster makes it a popular choice for machine learning tasks.

As we delve deeper into these architectures in the following sections, we will explore their pros and cons, use cases, and the unique features offered by Spark, Flink, Dask, Ray, and Bytewax. By understanding these architectures, you'll be better equipped to select the right framework for your next data processing venture. Stay tuned!

Data Parallel Architectures

Data parallelism is a form of parallelization that distributes the data across different nodes, which operate independently of each other. Each node applies the same operation on its allocated subset of data. This approach is particularly effective when dealing with large datasets where the task can be divided and executed simultaneously, reducing computational time significantly.

The Mechanism

In data parallel architectures, the dataset is split into smaller, more manageable chunks, or partitions. Each partition is processed independently by separate tasks running the same operation. This distribution is done in a way that each task operates on a different core or processor, enabling high-level parallel computation.

Advantages

Scalability: Data parallel architectures are designed to handle large volumes of data. As data grows, you can simply add more nodes to the system to maintain performance.
Performance: The ability to perform computations in parallel leads to significant speedups, particularly for large datasets and computationally intensive operations. Due to the fact that data does not move around as often to different workers, there can also be a performance gain.
Simplicity: Since the same operation is applied to each partition, this model is relatively simple to understand and implement.

Disadvantages

Communication Overhead: The nodes need to communicate with each other to synchronize and aggregate results, which can add overhead, particularly for large numbers of nodes.
Limited Use Cases: Data parallelism works best when the same operation can be applied to all data partitions. It's less suitable for tasks that require complex interdependencies or shared state across tasks. As we have seen with spark though, this is not entirely true.

Best Use Cases

Data parallel architectures excel in situations where large volumes of data need to be processed quickly and in a similar manner. Some of the best use cases include:

Batch Processing: In scenarios where large amounts of data need to be processed all at once, data parallel architectures shine. This is a common use case in big data analytics, where massive datasets are processed in batch jobs.
Machine Learning: Many machine learning algorithms, especially those that involve matrix operations, can be easily parallelized. For instance, in the training phase of a neural network, the weights of the neurons are updated based on the error. This operation can be done in parallel for each layer, making data parallelism a great fit.
High Partitioned Input and Output: Data parallel frameworks excel when the input and output are partitioned in such a way that the workers can evenly match the partitions and redistribution of the data is limited.
Stream Processing: The data parallelism approach is well suited to stream processing where the same operation is happening to data in real-time.

Apache Spark, a notable data parallel framework, is widely used in big data analytics for tasks like ETL (Extract, Transform, Load), predictive analytics, and data mining. It's particularly known for its ability to perform complex data transformations and aggregations across large datasets.

Bytewax is known for its ability to handle large continuos streams of data and do complex transformations on them in real-time.

As we continue our exploration into the different data processing architectures, we'll see how other approaches handle tasks that might not be as suitable for data parallel processing.

Task Parallel Architectures: Unlocking Concurrent Processing

Task parallelism, also known as function parallelism, is an architectural approach that focuses on distributing tasks—rather than data—across different processing units. Each of these tasks can be a separate function or a method operating on different data or performing different computations. This type of parallelism is a great fit for problems where different operations can be performed concurrently on the same or different data.

The Mechanism

In a task parallel model, the focus is on concurrent execution of many different tasks that are part of a larger computation. These tasks can be independent, or they can have defined dependencies and need to be executed in a certain order. The tasks are scheduled and dispatched to different processors in the system, enabling parallel execution.

Advantages

Diverse Workloads: Task parallel architectures excel in scenarios where the problem can be broken down into a variety of tasks that can be executed in parallel.
Flexibility: Since tasks don't necessarily need to operate on the same data or perform the same operation, this model offers a high level of flexibility.
Efficiency: Task parallelism can lead to improved resource utilization, as tasks can be scheduled to keep all processors busy.

Disadvantages

Complexity: Managing and scheduling tasks, especially when there are dependencies, can add complexity to the system.
Inter-task Communication: Tasks often need to communicate with each other to synchronize or to pass data, which can lead to overhead and can be a challenge for performance.

Best Use Cases

Task parallel architectures are best suited to problems that can be broken down into discrete tasks that can run concurrently. This includes:

Complex Computations: Scenarios where a complex problem can be broken down into a number of separate tasks, such as simulations or optimization problems, are a good fit for task parallel architectures.
Real-Time Processing On Diverse Datasets: Task parallel architectures are often used in systems that require real-time processing and low latency, such as stream processing systems.

Apache Flink is an excellent example of a system that uses a task parallel architecture. Flink is designed for stream processing, where real-time results are of utmost importance. It breaks down stream processing into a number of tasks that can be executed in parallel, providing low-latency and high-throughput processing of data streams.

Similarly, Dask is a flexible library for parallel computing in Python that uses task scheduling for complex computations. Dask allows you to parallelize and distribute computation by breaking it down into smaller tasks, making it a popular choice for tasks that go beyond the capabilities of typical data parallel tools.

In the next section, we'll explore the agent actor model, a different approach to managing concurrency and state that opens up new possibilities for parallel computation.

Agent Actor Architectures: Pioneering Concurrent Computations

Agent actor architectures introduce a fundamentally different approach to handle parallel computations, particularly for problems that involve complex, stateful computations. This approach build on task parallelism with the addition of an actor. An actor is a computational entity that, in response to a message it receives, can concurrently: make local decisions, create more actors, send more messages, and determine how to respond to the next message received. The agents are then similar to task distributed or functional distributed systems.

The Mechanism

In the agent actor model, actors are the universal primitives of concurrent computation. Upon receiving a message, an actor can change its local state, send messages to other actors, or create new actors. Actors encapsulate their state, avoiding common pitfalls of multithreaded programming such as race conditions. Actor systems are inherently message-driven and can be distributed across many nodes, making them highly scalable.

Advantages

Concurrent State Management: Actors provide a safe way to handle mutable state in a concurrent system. Since each actor processes messages sequentially and has isolated state, there is no need for locks or other synchronization mechanisms.
Scalability: Actor systems are inherently distributed and can easily scale out across many nodes.
Fault Tolerance: Actor systems can be designed to be resilient with self-healing capabilities. If an actor fails, it can be restarted, and messages it was processing can be redirected to other actors.

Disadvantages

Complexity: Building systems with the actor model can be more complex than traditional paradigms due to the asynchronous and distributed nature of actors.
Message Overhead: Communication between actors is done with messages, which can lead to overhead, especially in systems with a large number of actors.

Best Use Cases

Agent actor architectures are best suited for problems that involve complex, stateful computations and require high levels of concurrency. This includes:

Real-time Systems: The actor model is well suited for real-time systems where you need to process high volumes of data concurrently, such as trading systems or real-time analytics.
Distributed Systems: The actor model can be a good fit for building distributed systems where you need to manage state across multiple nodes, like IoT systems or multiplayer online games.

Ray is an example of a system that employs the actor model. It was designed to scale Python applications from a single node to a large cluster, and it's commonly used for machine learning tasks, which often require complex, stateful computations.

As we've seen, the landscape of data processing architectures is rich and diverse, with each model offering unique strengths and potential challenges. Whether it's data parallel, task parallel, or agent actor, the choice of architecture will depend largely on the nature of the data workload and the specific requirements of the system you're building.

Reasoning about Streaming vs Batch with a Case Study from GitHub

Zander — Thu, 15 Jun 2023 20:26:32 +0000

If you prefer videos check out Zander's talk at Data Council 2023 "When to Move from Batch to Streaming and how to do it without hiring an entirely new team"

The world of data processing is undergoing a significant shift, moving towards real-time processing. Despite an increase in understanding that shifting workloads to real-time can increase ROI and lower costs, there isn't consensus in the industry around how to best transition workloads to real-time and what the best tools are for different types of real-time workloads. While traditional analytical tools such as data warehouses, business intelligence layers, and metrics are widely accepted and understood, the concept of real-time data processing and the technologies that enable it are not as widely recognized or agreed upon.

In this post, we aim to demystify real-time data processing, discussing its relevance within an organization, the different types of real-time workloads, and some real-world examples from my time at GitHub. But first, let's clarify some definitions.

Understanding Real-Time and Stream Processing

What is Real-Time? "Real-Time" refers to anything that is perceived to happen in real time by a human - an admittedly fuzzy definition. Quantitatively, this usually refers to processes that happen in the sub-second realm. Interestingly, based on this definition, real-time data processing can actually occur with both batch and stream processing technologies depending on the end-to-end latency.

Stream processing refers to processing a single datum at a time, flowing in a continuous stream, while batch processing is when you gather a batch of data and process it all at once. By reducing the size of the batch progressively, we can edge closer to real-time processing. This is precisely what technologies like Spark's structured streaming do with micro-batches.

Now that we have some definitions out of the way, let's dive into real-time processing and the different types of real-time workloads.

The Relevance of Real-Time Data Processing

In our day-to-day lives, we're constantly receiving and processing information in real-time. Consider driving a car - an activity that requires processing multiple inputs and making decisions in real-time. If we were to approach driving in a batch processing manner, waiting to gather information for a duration and then trying to forecast the next 15 seconds, it would likely end in disaster. As another example you can imagine a sport like basketball. For each moment in time, the players are receiving tens or hundreds of inputs and they are reacting to them in real-time. If we also imagined a non-real-time version of this, it might not be so exciting to watch or play as each player waited for a certain number of seconds while receiving input and then tried to react to those inputs.

These examples help to highlight why we might choose to process things in real-time. In the context of driving, we're making decisions that could potentially be a matter of life or death. And in our basketball example, the real-time processing elevates the user experience. However, while these examples provide some understanding, they don't necessarily help us generalize the concept.

Types of Real-Time Workloads

We can broadly categorize real-time processing into two types of workloads: analytical and operational.

Analytical workloads

Analytical workloads require low latency, freshness, and the capability of retrieval at scale. Real-time analytical workloads must be queryable. A good example of this is LinkedIn's profile view notification. When you click into the profile view notification, you're taken to a page that shows your profile views history, all the way up to the most recent data. This demonstrates the freshness of data and the ability to query it as you can filter and interact with the data, querying the freshest data.

Another example of a real-time analytical workload is an Instacart order. When you place an order on Instacart, you can go into your order and see the updated Estimated Time of Arrival (ETA). This is another instance of an analytical real-time workload where the user is interacting with analytical data in real-time.

Operational workloads

On the other hand, operational workloads, require low latency and freshness, but they also need to be reactive. This means that some of the decision-making or business logic is embedded inline in the system. For example, in a streaming use case, this would be inside the stream processor. The data is received, transformed, and then a decision is made in an online fashion. Bytewax is a great example of a framework that can be used to make real-time decisions for operational workloads.

A good example of operational processing is fraud detection. The fraud detection system takes all the inputs in real time and makes a decision about them without a human in the loop. It then makes a decision on what to do and communicates with the user to confirm if its suspicion of fraud is correct.

Another example in financial markets is high-frequency trading. The software system consumes inputs from a variety of different data sources, processes them in real time, and then makes a decision whether to buy or sell. The speed of making that decision is a key factor in this context.

Analytical vs Operational

One more aspect I wanted to touch on here is the difference between having a human in the loop versus having a machine in the loop. If we look at different examples across analytical and operational workloads, there's a concept of the human being more involved in the analytical and less or not involved in the operational.

To summarize, if there is a situation where you believe there's value to be derived and there's a human in the loop, there's probably a subset of tools within the real-time space that fall under the analytical workload. If you're building something like an algorithmic trading system, where you believe that there's no requirement for a human in the loop, you're more likely to fall under the operational category, and you should look at tools, like Bytewax, that support operational processing.

Case Studies. GitHub's Real-Time Data Processing Decisions

Let's make things more concrete by discussing a couple of case studies involving decisions we made at GitHub concerning real-time data processing.

Trending Repositories and Developers: A Batch Processing Approach

The team I was a part of at GitHub was responsible for several data products that were featured on github.com, including Trending Repositories and Trending Developers. These features were located on the GitHub Explore page and aimed to identify trending repositories and developers based on a variety of metrics, such as stars, forks, and views. We had access to this data in real time through a streaming platform (Kafka) managed by another team.

Although we had the capacity to implement these as real-time features, we decided against it. Our team primarily consisted of data scientists and machine learning engineers who hadn't worked with streaming platforms or stream processors before. Moreover, these features were new products, and we didn't know how impactful they would be or whether users would find them valuable and engage with them repeatedly.

Instead of implementing these features to use real-time data, we decided to process this data in a batch format. We would run nightly queries against Presto, where the data landed from Kafka, then store the processed data in a MySQL database for retrieval from github.com. These features were not real-time workloads, but they could have been. If it was determined they would be useful as real-time data products, they would serve as excellent examples of analytical use cases.

Star Spam Detection: A Real-Time Processing Solution

Another task we undertook was star spam detection. The concept of "stars" on GitHub repositories is used as a proxy to gauge the health and utility of the project. If we were unable to detect star spammers, it would degrade the platform's value for users, potentially leading to a downward spiral in the platform's overall value.

We decided to tackle this problem in a real-time manner to limit the exposure of the users and potential degradation of the platform from star spam. The data was available in Kafka and could therefore be consumed as it was available. Based on certain criteria, users could be flagged as spammers and then action taken. Once a user was flagged, they were submitted for human review to decide on what the next steps should be. This is an excellent example of operational processing.

The Impact of Real-Time Processing Decisions

The point is that the decision to implement real-time processing can have significant impacts on the return on investment for a project, and this correlation should be carefully considered. If we had decided to make the trending feature real-time, it would have been even more necessary to maintain the platform's value by detecting star spam as close to real-time as possible.

The value of data often degrades over time, and while it's usually depicted as a sharp decline (see graph on the left), most projects or data tend to follow more of an S-curve (on the right). After a certain point, the return or value of the data caps out with respect to latency. In these case studies, neither project saw an exponential increase in return on investment as latency was reduced, and we were able to tackle star spammers on an hours timeframe instead of milliseconds. This demonstrates that not all data projects need to move towards zero latency to provide significant value.

If you are interested in moving some of your workloads to real-time and not sure where to start. Please reach out to us in our slack channel and we would be happy to help you figure out if the value is there and where to start.

Bytewax v0.16.2 is out!

Oli Makhasoeva — Thu, 08 Jun 2023 20:35:31 +0000

🎉 Exciting News from Bytewax! 🎉

We're thrilled to announce the release of Bytewax v0.16.2!

Firstly, support for Windows builds is here! 🖥️

This is a significant step forward not only because it makes Bytewax more accessible to developers across different platforms but also because we're particularly excited to welcome the first contribution from a member of our community Jim Zhang @zzl221000!

A big shout-out to Jim!!!

In addition to Windows support, v0.16.2 also introduces a CSVInput subclass of FileInput, further expanding the versatility of Bytewax.

Here's a quick rundown of what's changed in this release:

We're OSS and incredibly grateful for the community's contributions what you're building with Bytewax, and happy coding! 🚀 Check out the changes on our GitHub.

Bytewax at Data Science Summit. Interactive Dashboards To Detect Data Anomalies In Real Time

Oli Makhasoeva — Wed, 24 May 2023 21:41:03 +0000

Data Science Summit

Data Science Summit is the largest and oldest independent data science conference in the CEE region. This year, we are joining them online and our CEO, Zander Matheson, is presenting! For the sixth time Data Science Summit shares knowledge in topics ranging from analysis and processing (including big data), implementation issues to visualisation (BI) and management topics. This year's edition of the most important Data Science event in Poland dedicated to Machine Learning!

10 tracks, 100+ talks, the agenda is packed with cutting-edge insights 💡

🎟️ Use code DSSML23RP20 until 09.06.2023 to grab a Standard or PRO ticket at a 20% discount

Here are details of the talk Zander is presenting:

Interactive dashboards to detect data anomalies in real time

Join Zander for a technical exploration of crafting interactive dashboards that employ online machine learning algorithms for real-time anomaly detection across hundreds of sensors. He will guide you through how to set up a development environment with a streaming system (Kafka or similar), load sensor data to the streaming system with Bytewax, and write a dataflow using River that will transform the data and use different anomaly detection algorithms to determine if there are anomalies in the sensor data. The icing on the cake? Visualize all these complex processes on a dynamic, real-time dashboard using Rerun! Equip yourself with the tools and knowledge to monitor and react to data anomalies as they happen. Come, experience the power of Python in data anomaly detection and interactive visualization in real time!

If this abstract sounds interesting, you might want to check out these blogs: Real-Time Anomaly Detection Visualization with Bytewax and Rerun and Online Machine Learning for IoT. The talk is going to go beyond these but it covers same domains.

We are looking forward to exchange knowledge, share our ideas and learn from the experiences of other attendees and speakers. Stay tuned for updates from the conference!

Easy yet flexible way to display child routes in tabs with Vue 3

Oli Makhasoeva — Tue, 09 May 2023 18:22:17 +0000

Hello, I'm Konrad Sieńkowski and I am a front-end developer & UI designer here at Bytewax. I want to share with you something that I worked on recently. In this article, I'll walk through the steps to set up a new Vue application, configure the router for nested routes, create the AppTabs.vue component, and customize your tabs using route meta fields for labels and icons. By the end, you'll know how to make an easy yet flexible solution for displaying child routes in tabs. So, let's dive in!

For those eager to dive in, check out the project repository on Github.

Prerequisites

First of all, we're going to create a fresh, new application using > npm init vue@latest. The vue-create tool is going to ask you about including optional features in the project. The only one required for that tutorial is Vue Router. I chose Typescript & Prettier as well, but it's up to your personal preferences.

Preparing routes & structure

Once you follow the instructions on installing dependencies and running the app, you can start customizing the application. My first step was to simplify app.vue a bit:

&lt;template&gt;
  &lt;nav&gt;
    &lt;RouterLink to=&quot;/&quot;&gt;Home&lt;/RouterLink&gt;
    &lt;RouterLink to=&quot;/tabs&quot;&gt;Tabs demo&lt;/RouterLink&gt;
  &lt;/nav&gt;

  &lt;RouterView /&gt;
&lt;/template&gt;

&lt;script setup lang=&quot;ts&quot;&gt;
import { RouterLink, RouterView } from &apos;vue-router&apos;
&lt;/script&gt;

Since we're focusing on nested/child routes in this article, there's no need to spend much time on the homepage. I've also renamed default AboutView.vue to TabsView.vue and created bunch of example views in views/tabs, called TabsAbout.vue, TabsBlog.vue, TabsContact.vue, TabsRelated.vue. We're going to include them in our routes structure in the next step.

- views
-- tabs
--- TabsAbout.vue
--- TabsBlog.vue
--- TabsContact.vue
--- TabsRelated.vue
-- HomeView.vue
-- TabsView.vue

As we have a simple structure for our views/pages, now it's time to include them in router configuration. Let's open router/index.ts now and adjust it to our needs:

import { createRouter, createWebHistory } from &apos;vue-router&apos;
import HomeView from &apos;../views/HomeView.vue&apos;

const router = createRouter({
  history: createWebHistory(import.meta.env.BASE_URL),
  routes: [
    {
      path: &apos;/&apos;,
      name: &apos;home&apos;,
      component: HomeView
    },
    {
      path: &apos;/tabs&apos;,
      name: &apos;tabs&apos;,
      component: () =&gt; import(&apos;../views/TabsView.vue&apos;),
      children: [
        {
          name: &apos;about&apos;,
          path: &apos;&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsAbout.vue&apos;),
        },
        {
          name: &apos;blog&apos;,
          path: &apos;blog&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsBlog.vue&apos;),
        },
        {
          name: &apos;contact&apos;,
          path: &apos;contact&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsContact.vue&apos;),
        },
        {
          name: &apos;related&apos;,
          path: &apos;related&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsRelated.vue&apos;),
        },
      ]
    }
  ]
})

export default router

Now, our application has nested/children routes which we can use to display tabs in the component.

Tabs component

In this step, we're going to create our tab component, include it in the first-level route view and then extend it with additional features. First of all, we're going to create file called AppTabs.vue in components directory. Since our component is going to be flexible and might be used in different routes, we're following Vue naming convention for base components.

Let's start from the <script setup> section. We're using useRouter() composable there to access the router instance. Then, we're using it to define tabs computed property.

&lt;script setup lang=&quot;ts&quot;&gt;
import { computed, type ComputedRef } from &apos;vue&apos;
import { useRouter, RouterView, type RouteRecordRaw } from &apos;vue-router&apos;

// Use children routes for the tabs
const router = useRouter()
const tabs: ComputedRef&lt;RouteRecordRaw[] | undefined&gt; = computed(() =&gt; {
  const currentRoute = router.currentRoute.value.name
  return router.options.routes?.find(
    (route) =&gt;
      route.name === currentRoute || route.children?.find((child) =&gt; child.name === currentRoute)
  )?.children
})
&lt;/script&gt;

After getting the current route name using router.currentRoute property, we're using it to find it within the routes array (either within top-level routes and their children) and return its children routes. Now it's time to include it in the component template:

&lt;template&gt;
  &lt;div class=&quot;tabs&quot; v-if=&quot;tabs&quot;&gt;
    &lt;nav class=&quot;tabs__nav&quot;&gt;
      &lt;RouterLink
        v-for=&quot;tab in tabs&quot;
        :key=&quot;tab.name&quot;
        class=&quot;tabs__nav-item&quot;
        :to=&quot;{ name: tab.name }&quot;
      &gt;
        {{ tab.name }}
      &lt;/RouterLink&gt;
    &lt;/nav&gt;
    &lt;div class=&quot;tabs__wrapper&quot;&gt;
      &lt;RouterView v-slot=&quot;{ Component }&quot;&gt;
        &lt;Transition name=&quot;fade&quot; mode=&quot;out-in&quot;&gt;
          &lt;component :is=&quot;Component&quot;&gt;&lt;/component&gt;
        &lt;/Transition&gt;
      &lt;/RouterView&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/template&gt;

Inside the <div> wrapper, we have two parts of our component:

navigation / tabs, where we iterate over output of tabs computed getter and display links of children routes,
tabs wrapper, where we're using native <RouterView> and its v-slot api to wrap nested route's content in <Transition> component.

Now we can include our component in the TabsView.vue code:

&lt;template&gt;
  &lt;div class=&quot;view&quot;&gt;
    &lt;AppTabs /&gt;
  &lt;/div&gt;
&lt;/template&gt;

&lt;script setup lang=&quot;ts&quot;&gt;
import AppTabs from &apos;@/components/AppTabs.vue&apos;
&lt;/script&gt;

And take a look at the result:

Extending & styling up the tabs

Our tabs work nice, and we can easily include them in any view that has child routes. However, the tabs navigation uses route.name as a link label, and route names should rather remain simple and easy to use. We can extend our solution with route props to include custom tab label & icon for each child route.

Use custom route props

Before extending our component's code, let's add meta field to each nested route in router/index.ts:

children: [
  {
    name: &apos;about&apos;,
    path: &apos;&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsAbout.vue&apos;),
    meta: { tabLabel: &apos;About&apos; }
  },
  {
    name: &apos;blog&apos;,
    path: &apos;blog&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsBlog.vue&apos;),
    meta: { tabLabel: &apos;Blog&apos; }
  },
  {
    name: &apos;contact&apos;,
    path: &apos;contact&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsContact.vue&apos;),
    meta: { tabLabel: &apos;Contact&apos; }
  },
  {
    name: &apos;related&apos;,
    path: &apos;related&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsRelated.vue&apos;),
    meta: { tabLabel: &apos;Related&apos; }
  },
]

Now, we can use tabLabel value in our AppTabs.vue component:

&lt;RouterLink
  v-for=&quot;tab in tabs&quot;
  :key=&quot;tab.name&quot;
  class=&quot;tabs__nav-item&quot;
  :to=&quot;{ name: tab.name }&quot;
&gt;
  &lt;span class=&quot;tabs__nav-label&quot; v-if=&quot;tab.meta?.tabLabel&quot;&gt;{{ tab.meta.tabLabel }}&lt;/span&gt;
&lt;/RouterLink&gt;

Add material icons to tabs navigation

Our tabs navigation is going to look better with icons. Let's install Google's Material Symbols library using npm package: npm install material-symbols@latest and include it in main.ts (main.js if you're not using typescript):

import { createApp } from &apos;vue&apos;
import App from &apos;./App.vue&apos;
import router from &apos;./router&apos;

import &apos;material-symbols/outlined.css&apos;;
import &apos;./assets/main.css&apos;

const app = createApp(App)

app.use(router)

app.mount(&apos;#app&apos;)

Then, we can add tabIcon properties to route meta fields, filling it with the icon codes:

children: [
  {
    name: &apos;about&apos;,
    path: &apos;&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsAbout.vue&apos;),
    meta: { tabLabel: &apos;About&apos;, tabIcon: &apos;group&apos; }
  },
  {
    name: &apos;blog&apos;,
    path: &apos;blog&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsBlog.vue&apos;),
    meta: { tabLabel: &apos;Blog&apos;, tabIcon: &apos;feed&apos; }
  },
  {
    name: &apos;contact&apos;,
    path: &apos;contact&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsContact.vue&apos;),
    meta: { tabLabel: &apos;Contact&apos;, tabIcon: &apos;email&apos; }
  },
  {
    name: &apos;related&apos;,
    path: &apos;related&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsRelated.vue&apos;),
    meta: { tabLabel: &apos;Related&apos;, tabIcon: &apos;star&apos; }
  },
]

After that, we're ready to include them in the component:

&lt;RouterLink
  v-for=&quot;tab in tabs&quot;
  :key=&quot;tab.name&quot;
  class=&quot;tabs__nav-item&quot;
  :to=&quot;{ name: tab.name }&quot;
&gt;
  &lt;span class=&quot;tabs__nav-icon material-symbols-outlined&quot; v-if=&quot;tab.meta?.tabIcon&quot;&gt;{{
    tab.meta.tabIcon
  }}&lt;/span&gt;
  &lt;span class=&quot;tabs__nav-label&quot; v-if=&quot;tab.meta?.tabLabel&quot;&gt;{{ tab.meta.tabLabel }}&lt;/span&gt;
&lt;/RouterLink&gt;

Done! We have custom icons & labels based on route meta fields displayed in our Tabs component. Now it's time to add final styling touch with CSS.

Styling up the component

You can style up the component on your own, customizing it fully to your needs or use code below including it in AppTabs.vue below:

&lt;style&gt;
.tabs {
  border: 1px solid rgba(0, 0, 0, 0.2);
  border-radius: 0.5rem;
}
.tabs__wrapper {
  padding: 1.5rem 2rem 2rem 2rem;
}
.tabs__nav {
  display: flex;
  flex-direction: row;
  border-bottom: 1px solid rgba(0, 0, 0, 0.2);
}
.tabs__nav-item {
  display: flex;
  flex-direction: row;
  align-items: center;
  flex-wrap: nowrap;
  text-decoration: none;
  padding: 1rem;
  border-bottom: 3px solid transparent;
  margin-bottom: -1px;
  color: rgba(0, 0, 0, 0.87);
  transition: border-color 0.25s ease-in-out;
}
.tabs__nav-icon {
  margin-right: 0.5rem;
  color: rgba(0, 0, 0, 0.38);
}
.tabs__nav-item:hover {
  border-color: #ccc;
}
.tabs__nav-item.router-link-exact-active {
  border-color: var(--green);
  font-weight: 600;
}
&lt;/style&gt;

Note: Following BEM naming convention is easier using SCSS but I didn't want to fill the example with extra dependencies.

Our tab component looks pretty slick now:

Instead of conclusion

Now, I encourage you to give it a try, explore further customizations, and share your experiences and improvements with our community. Let's continue building more efficient and elegant applications together!

Lessons we learned while building a stateful Kafka connector and tips for creating yours

Oli Makhasoeva — Wed, 03 May 2023 20:16:54 +0000

The Bytewax framework is a flexible tool designed to meet the challenges faced by Python developers in today's data-driven world. It aims to provide seamless integrations and time-saving shortcuts for data engineers dealing with streaming data, making their work more efficient and effective. One of the important sides of developing Bytewax is input connectors. These connectors help in establishing the connection between the external systems and Bytewax to help users in importing data from external systems.

Here we're going to show how to write a custom input connector by walking through how we wrote our built-in Kafka input connector.

Writing input connectors for arbitrary systems while supporting failure recovery and strong delivery guarantees requires a solid understanding of how recovery works internal to Bytewax and the chosen output system. We strongly encourage you to use the connectors we have built into bytewax.connectors if possible, and read the documentation on their limits.

If you are interested in writing your own, this article can give you an introduction into some of the decisions involved in writing an input connector for an ordered, partitioned input stream.

If you need any help at all writing a connector, come say "hi" and ask questions in the Bytewax community Slack! We are happy to help!

Partitions

Writing a subclass for bytewax.inputs.PartitionedInput is the core API for writing an input connector when you have an input that has a fixed number of partitions. A partition is a "sub-stream" of data that can be read concurrently and independently.

To write a PartitionedInput subclass, you need to answer three questions:

How many partitions are there?
How can I build a source that reads a single partition?
How can I rewind a partition and read from a specific item?

This is done via the abstract methods list_parts, build_part, and the resume_state variable respectively.

We're going to use the confluent-kafka package to actually communicate with the Kafka cluster. Let's import all the things we'll need for this input source.

from typing import Dict, Iterable

from confluent_kafka import (
    Consumer,
    KafkaError,
    OFFSET_BEGINNING,
    TopicPartition,
)
from confluent_kafka.admin import AdminClient

from bytewax.inputs import PartitionedInput, StatefulSource

Our KafkaInput connector is going to read from a specific set of topics on a cluster. First, let's define our class and write a constructor that takes all the arguments that make sense for configuring this specific kind of input source. This is going to be the public entry point to this connector, and is what you'll pass to the bytewax.dataflow.Dataflow.input operator.

class KafkaInput(PartitionedInput):
    def __init__ (
        self,
        brokers: Iterable[str],
        topics: Iterable[str],
        tail: bool = True,
        starting_offset: int = OFFSET_BEGINNING,
        add_config: Dict[str, str] = None,
    ):
        add_config = add_config or {}

        if isinstance(brokers, str):
            raise TypeError(&quot;brokers must be an iterable and not a string&quot;)
        self

Listing Partitions

Next, let's answer question one: how many partitions are there? Conveniently, confluent-kafka provides an AdminClient.list_topics which give you the partition count of each topic, packed deep in a metadata object. The signature of PartitionedInput.list_parts says it must return a set of strings with IDs of all the partitions. Let's build the AdminClient using our configuring instance variables and then delegate to a _list_parts function so we can re-use it if necessary.

# Continued
# class KafkaInput(PartitionedInput):
    def list_parts(self):
        config = {
            &quot;bootstrap.servers&quot;: &quot;,&quot;.join(self._brokers),
        }
        config.update(self._add_config)
        client = AdminClient(config)

        return set(_list_parts(client, self._topics))

This function unpacks the nested metadata returned from AdminClient.list_topics, and returns a string that looks like "3-my_topic" for the third partition in the topic my_topic.

def _list_parts(client, topics):
    for topic in topics:
        # List topics one-by-one so if auto-create is turned on,
        # we respect that.
        cluster_metadata = client.list_topics(topic)
        topic_metadata = cluster_metadata.topics[topic]
        if topic_metadata.error is not None:
            raise RuntimeError(
                f&quot;error listing partitions for Kafka topic `{topic!r}`: &quot;
                f&quot;{topic_metadata.error.str()}&quot;
            )
        part_idxs = topic_metadata.partitions.keys()
        for i in part_idxs:
            yield f&quot;{i}-{topic}&quot;

How do you decide what the partition ID string should be? It should be something that globally identifies this partition, hence combining partition number and topic name.

PartitionedInput.list_parts might be called multiple times from multiple workers as a Bytewax cluster is setup and resumed, so it must return exactly the same set of partitions on every call in order to work correctly. Changing numbers of partitions is not currently supported with recovery.

Building Partitions

Next, let's answer question two: how can I build a source that reads a single partition? We can use confluent-kafka's Consumer to make a Kafka consumer that will read a specific topic and partition starting from an offset. The signature of PartitionedInput.build_part takes a specific partition ID (we'll ignore the resume state for now) and must return a stateful source.

We parse the partition ID to determine which Kafka partition we should be consuming from. (Hence the importance of having a globally unique partition ID.) Then we build a Consumer that connects to the Kafka cluster, and build our custom _KafkaSource stateful source. That is where the actual reading of input items happens.

# Continued
# class KafkaInput(PartitionedInput):
    def build_part(self, for_part, resume_state):
        part_idx, topic = for_part.split(&quot;-&quot;, 1)
        part_idx = int(part_idx)
        assert topic in self._topics, &quot;Can&apos;t resume from different set of Kafka topics&quot;

        config = {
            # We&apos;ll manage our own &quot;consumer group&quot; via recovery
            # system.
            &quot;group.id&quot;: &quot;BYTEWAX_IGNORED&quot;,
            &quot;enable.auto.commit&quot;: &quot;false&quot;,
            &quot;bootstrap.servers&quot;: &quot;,&quot;.join(self._brokers),
            &quot;enable.partition.eof&quot;: str(not self._tail),
        }
        config.update(self._add_config)
        consumer = Consumer(config)
        return _KafkaSource(
            consumer, topic, part_idx, self._starting_offset, resume_state
        )

Stateful Input Source

What is a stateful source? It is defined by subclassing bytewax.inputs.StatefulSource. You can think about it as a "snapshot-able Python iterator": something that produces a stream of items via StatefulSource.next, and also lets the Bytewax runtime ask for a snapshot of the position of the source via StatefulSource.snapshot.

Our _KafkaSource is going to read items from a specific Kafka topic's partition. Let's define that class and have a constructor that takes in all the details to start reading that partition: the consumer (already configured to connect to the correct Kafka cluster), the topic, the specific partition index, the default starting offset (beginning or end of the topic), and again we'll ignore the resume state for just another moment.

class _KafkaSource(StatefulSource):
    def __init__ (self, consumer, topic, part_idx, starting_offset, resume_state):
        self._offset = resume_state or starting_offset
        # Assign does not activate consumer grouping.
        consumer.assign([TopicPartition(topic, part_idx, self._offset)])
        self._consumer = consumer
        self._topic = topic

The beating heart of the input source is the StatefulSource.next method. It is periodically called by Bytewax and behaves similar to a built-in Python iterator's __next__ method. It must do one of three things: return a new item to send into the dataflow, return None signaling that there is no data currently but might be later, or raise StopIteration when the partition is complete.

Consumer.poll gives us a method to ask if there are any new messages on the partition we setup this consumer to follow. And if there are, unpack the data message and return it. Otherwise handle the no data case, the end-of-stream case, or an exceptional error case.

# Continued
# class _KafkaSource(StatefulSource):
    def next(self):
        msg = self._consumer.poll(0.001) # seconds
        if msg is None:
            return
        elif msg.error() is not None:
            if msg.error().code() == KafkaError._PARTITION_EOF:
                raise StopIteration()
            else:
                raise RuntimeError(
                    f&quot;error consuming from Kafka topic `{self.topic!r}`: {msg.error()}&quot;
                )
        else:
            item = (msg.key(), msg.value())
            # Resume reading from the next message, not this one.
            self._offset = msg.offset() + 1
            return item

An important thing to note here is that StatefulSource.next must never block. The Bytewax runtime employs a sort of cooperative multitasking, and so each operator must return quickly, even if it has nothing to do, so other operators in the dataflow that do have work can run. Unfortunately, currently there is no way in the Bytewax API to prevent polling of input sources (as input comes from outside the dataflow, Bytewax has no way of knowing when more data is available, so must constantly check). The best practice here is to pause briefly if there is no data to prevent a full spin-loop on no new data, but not so long you block other operators from doing their work.

There is also a StatefulSource.close method which enables you to do any well-behaved shutdown when EOF is reached. This is not guaranteed to be called in a failure situation and should not be crucial to the connecting system. In this case, Consumer.close does graceful shutdown.

# class _KafkaSource(StatefulSource):
    def close(self):
        self._consumer.close()

Resume State

Lets explain how failure recovery works for input connectors. Bytewax's recovery system allows the dataflow to quickly resume processing and output without needing to replay all input. It does this by periodically snapshot all internal state, input positions, and output positions of the dataflow. Then when it needs to recover after a failure, it loads all state from a recent snapshot, and starts re-playing input items in the same order from the instant of the snapshot and overwriting output items. This will cause the state and output of the dataflow to evolve in the same way during the resume execution as during the previous execution.

Snapshotting

So, we need to keep track of the current position somewhere in each partition. Kafka has the concept of message offsets, which is an incrementing immutable integer that is the position of each message. In _KafkaSource.next, we kept track of the offset of the next message that partition will read via self._offset = msg.offset() + 1.

Bytewax calls StatefulSource.snapshot when it needs to record that partition's position and returns that internally stored next message offset.

# Continued
# class _KafkaSource(StatefulSource):
    def snapshot(self):
        return self._offset

Resume

On resume after a failure, Bytewax's recovery machinery does the hard work of collecting all the snapshots, finding the ones that represent a coherent set of states across the previous execution's cluster, and threading each bit of snapshot data back through into PartitionedInput.build_part for the same partition. To properly take advantage of that, your resulting partition must resume reading from the same spot represented by that snapshot.

Since we were storing the Kafka message offset of the next message to be read in _KafkaSource._offset, we need to ensure we thread through that message offset back into the Consumer when it is built. That happens via passing resume_state into the _KafkaSource constructor, and it assigning that consumer to start reading from that offset. Looking at that code again:

# Continued
# class _KafkaSource(StatefulSource):
# def __init__ (self, consumer, topic, part_idx, starting_offset, resume_state):
        self._offset = resume_state or starting_offset
        # Assign does not activate consumer grouping.
        consumer.assign([TopicPartition(topic, part_idx, self._offset)])
        ...

As one extra wrinkle, if there is no resume state for this partition if the partition is being built for the first time, None will be passed for resume_state in PartitionedInput.build_part. In that case, we need to fill in the requested "default starting offset": either "beginning of topic" or "end of topic". In the case where we do have resume state, we should ignore that since we need to start from the specific offset to uphold the recovery contract.

Delivery Guarantees

Let's talk for a moment about how this recovery model with snapshots impacts delivery guarantees. A well-designed input connector on its own can only guarantee that the output of a dataflow to a downstream system is at-least-once: the recovery system will ensure that we replay any input that might not have been output due to where the execution cluster failed, but it requires coordination with the output connector (via something like transactions or two-phase commits) to ensure that the replay does not result in duplicated writes downstream and exactly-once processing.

Non-Replay-Able Sources

If your input source does not have the ability to replay old data, you can still use it with Bytewax, but your delivery guarantees are limited to at-least-once. For example, listening to an ephemeral SSE or WebSocket stream, you can always start listening, but often the request API does not let you specify an ability to replay missing events. When Bytewax attempts to resume, all the other operators will have their internal state returned to that last coherent snapshot, but since the input sources do not rewind, it will appear that the dataflow has missed out on all input between when that snapshot was taking and resume.

In this case, your StatefulSource.snapshot can return None and no recovery data will be saved. You can then ignore the resume_state argument of PartitionedInput.build_part because it will always be None.

How We Detect Anomalies In Our AWS Infrastructure (And Have Peaceful Nights)

Oli Makhasoeva — Tue, 02 May 2023 18:50:54 +0000

Introduction

Everyone who's using a cloud provider wants to monitor the system to detect anomalies in the usage. We run some internal data services, our website/blog and a few demo clusters on AWS and we wanted a low-maintenance way to monitor the infrastructure for issues, so we took the opportunity to dogfood Bytewax, of course :).

In this blog post, we will walk you through the process of building a cloud-based anomaly detection system using Bytewax, Redpanda, and Amazon Web Services (AWS). Our goal is to create a dataflow that detects anomalies in EC2 instance CPU utilization. To achieve this, we will collect usage data from AWS CloudWatch using Logstash and store it using Redpanda, a Kafka-compatible streaming data platform. Finally, we will use Bytewax, a Python stream processor, to build our anomaly detection system.

This is exactly the same infrastructure we use internally at Bytewax and, in fact, we haven't touched it for months!

Setting Up the Required Infrastructure on AWS

Before we begin, ensure that you have the following prerequisites set up:

AWS CLI configured with admin access
Helm
Docker
A Kubernetes cluster running in AWS and kubectl configured to access it

Configuring Kubernetes and Redpanda

In this section, we will configure Kubernetes and Redpanda using the provided code snippets. Make sure you have a running Kubernetes cluster in AWS and kubectl configured to access it.

Step 1: Set up a namespace

Create a new namespace for Redpanda and set it as the active context:

kubectl create ns redpanda-bytewax


kubectl config set-context --current --namespace=redpanda-bytewax

Step 2: Install Cert-Manager and Redpanda Operator

The Redpanda operator requires cert-manager to create certificates for TLS communication. To install cert-manager with Helm:

helm repo add jetstack https://charts.jetstack.io &amp;&amp; \
helm repo update &amp;&amp; \
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.4.4 \
  --set installCRDs=true

Fetch the latest Redpanda Operator version, add the Redpanda Helm repo, and install the Redpanda Operator:

export VERSION=$(curl -s https://api.github.com/repos/redpanda-data/redpanda/releases/latest | jq -r .tag_name)


helm repo add redpanda https://charts.vectorized.io/ &amp;&amp; helm repo update


kubectl apply -k https://github.com/redpanda-data/redpanda/src/go/k8s/config/crd?ref=$VERSION


helm install redpanda-operator redpanda/redpanda-operator --namespace redpanda-system --create-namespace --version $VERSION

Step 3: Create Redpanda cluster

Save the following YAML configuration in a file named 3_node_cluster.yaml:

apiVersion: redpanda.vectorized.io/v1alpha1
kind: Cluster
metadata:
  name: three-node-cluster
spec:
  image: &quot;vectorized/redpanda&quot;
  version: &quot;latest&quot;
  replicas: 3
  resources:
    requests:
      cpu: 1
      memory: 1.2Gi
    limits:
      cpu: 1
      memory: 1.2Gi
  configuration:
    rpcServer:
      port: 33145
    kafkaApi:
    - port: 9092
    pandaproxyApi:
    - port: 8082
    schemaRegistry:
      port: 8081
    adminApi:
    - port: 9644
    developerMode: true

Apply the Redpanda cluster configuration:

kubectl apply -f ./3_node_cluster.yaml

Check the status of Redpanda pods:

kubectl get po -lapp.kubernetes.io/component=redpanda

Export the broker addresses:

export BROKERS=`kubectl get clusters three-node-cluster -o=jsonpath=&apos;{.status.nodes.internal}&apos; | jq -r &apos;join(&quot;,&quot;)&apos;`

Step 4: Set up topics

Run an rpk container to create and manage topics:

kubectl run rpk-shell --rm -i --tty --image vectorized/redpanda --command /bin/bash

In the rpk terminal, export the broker addresses:

export BROKERS=three-node-cluster-0.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-1.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-2.three-node-cluster.redpanda-bytewax.svc.cluster.local.

View the cluster information:

rpk --brokers $BROKERS cluster info

Create two topics with 5 partitions each:

rpk --brokers $BROKERS topic create ec2_metrics -p 5


rpk --brokers $BROKERS topic create ec2_metrics_anomalies -p 5

List the topics:

rpk --brokers $BROKERS topic list

Consume messages from the ec2_metrics topic:

rpk --brokers $BROKERS topic consume ec2_metrics -o start

Exporting CloudWatch EC2 Metrics to our Redpanda Cluster with Logstash

Logstash is an open-source data processing pipeline that can ingest data from multiple sources, transform it, and send it to various destinations, such as Redpanda. In this case, we'll use Logstash to collect EC2 metrics from CloudWatch and send them to our Redpanda cluster for further processing.

Logstash Permissions

First, we need to create an AWS policy and user with the required permissions for Logstash to access CloudWatch and EC2. Save the following JSON configuration in a file named cloudwatch-logstash-policy.json:

{
    &quot;Version&quot;: &quot;2012-10-17&quot;,
    &quot;Statement&quot;: [
        {
            &quot;Sid&quot;: &quot;Stmt1444715676000&quot;,
            &quot;Effect&quot;: &quot;Allow&quot;,
            &quot;Action&quot;: [
                &quot;cloudwatch:GetMetricStatistics&quot;,
                &quot;cloudwatch:ListMetrics&quot;
            ],
            &quot;Resource&quot;: &quot;*&quot;
        },
        {
            &quot;Sid&quot;: &quot;Stmt1444716576170&quot;,
            &quot;Effect&quot;: &quot;Allow&quot;,
            &quot;Action&quot;: [
                &quot;ec2:DescribeInstances&quot;
            ],
            &quot;Resource&quot;: &quot;*&quot;
        }
    ]
}

Now we can create the policy and user, and attach the policy to the user:

aws iam create-policy --policy-name CloudwatchLogstash --policy-document file://cloudwatch-logstash-policy.json
aws iam create-user --user-name logstash-user


export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query &quot;Account&quot; --output text)


aws iam attach-user-policy --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/CloudwatchLogstash --user-name logstash-user

To provide access, we can create Kubernetes secrets for the AWS access key and secret access key:

kubectl create secret generic aws-secret-access-key --from-literal=value=$(aws iam create-access-key --user-name logstash-user | jq -r .AccessKey.SecretAccessKey)


kubectl create secret generic aws-access-key-id --from-literal=value=$(aws iam list-access-keys --user-name logstash-user --query &quot;AccessKeyMetadata[0].AccessKeyId&quot; --output text)

Now we can create an Amazon Elastic Container Registry (ECR) repository to store the custom Logstash image:

aws ecr create-repository --repository-name redpanda-bytewax


export REPOSITORY_URI=$(aws ecr describe-repositories --repository-names redpanda-bytewax --profile sso-admin --output text --query &quot;repositories[0].repositoryUri&quot;)

Next, we create a Logstash Image with CloudWatch Input Plugin installed by creating a Dockerfile named logstash-Dockerfile that has the plugin installed as a RUN step in the Dockerfile like shown in the dockerfile code snippet:

FROM docker.elastic.co/logstash/logstash:7.17.3
RUN bin/logstash-plugin install logstash-input-cloudwatch

Finally, we build and push the Logstash image to the ECR repository:

docker build -f logstash-Dockerfile -t $REPOSITORY_URI:\logstash-cloudwatch .


export AWS_REGION=us-west-2


aws ecr get-login-password --region $AWS_REGION --profile sso-admin | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com


docker push $REPOSITORY_URI:\logstash-cloudwatch

Deploy Logstash on Kubernetes

Now that we have our custom Logstash image, we will deploy it on Kubernetes using the Helm chart provided by Elastic. First, we need to gather some information and create a logstash-values.yaml file with the necessary configuration.

Run the following commands to obtain the required information:

echo $REPOSITORY_URI


echo $AWS_REGION


echo $BROKERS | sed -e &apos;s/local\./local\:9092/g&apos;

Create a logstash-values.yaml file and replace the placeholders (shown with <>) with the information obtained above:

image: &quot;&lt;YOUR REPOSITORY URI&gt;&quot;
imageTag: &quot;logstash-cloudwatch&quot;
imagePullPolicy: &quot;Always&quot;

persistence:
  enabled: true

logstashConfig:
  logstash.yml: |
    http.host: 0.0.0.0
    xpack.monitoring.enabled: false

logstashPipeline:
  uptime.conf: |
    input {
      cloudwatch {
        namespace =&gt; &quot;AWS/EC2&quot;
        metrics =&gt; [&quot;CPUUtilization&quot;]
        region =&gt; &quot;&lt;YOUR AWS REGION&gt;&quot;
        interval =&gt; 300
        period =&gt; 300
      }       
    }
    filter {
      mutate {
        add_field =&gt; {
          &quot;[index]&quot; =&gt; &quot;0&quot;
          &quot;[value]&quot; =&gt; &quot;%{maximum}&quot;
          &quot;[instance]&quot; =&gt; &quot;%{InstanceId}&quot;                      
        }
      }
    }
    output {
        kafka {
          bootstrap_servers =&gt; &quot;&lt;YOUR REDPANDA BROKERS&gt;&quot;
          topic_id =&gt; &apos;EC2Metrics&apos;
          codec =&gt; json
        }
    }

extraEnvs:
  - name: &apos;AWS_ACCESS_KEY_ID&apos;
    valueFrom:
      secretKeyRef:
        name: aws-access-key-id
        key: value
  - name: &apos;AWS_SECRET_ACCESS_KEY&apos;
    valueFrom:
      secretKeyRef:
        name: aws-secret-access-key
        key: value

With the logstash-values.yaml file ready, install the Logstash Helm chart:

helm upgrade --install logstash elastic/logstash -f logstash-values.yaml

Now to verify that Logstash is exporting the EC2 metrics to the Redpanda cluster, open a terminal with rpk and consume the ec2_metrics topic:

rpk --brokers $BROKERS topic consume ec2_metrics -o start

Use CTRL-C to quit the rpk terminal when you're done.

Building a Dataflow to Detect Anomalies with Bytewax

With our infrastructure in place, it's time to build a dataflow to detect anomalies. We will use Bytewax and Waxctl to define and deploy a dataflow that processes the EC2 instance CPU utilization data stored in the Redpanda cluster.

Anomaly Detection with Half Space Trees

Half Space Trees (HST) is an unsupervised machine learning algorithm used for detecting anomalies in streaming data. The algorithm is designed to efficiently handle high-dimensional and high-velocity data streams. HST builds a set of binary trees to partition the feature space into half spaces, where each tree captures a different view of the data. By observing the frequency of points falling into each half space, the algorithm can identify regions that are less dense than others, suggesting that data points within those regions are potential anomalies.

In our case, we will use HST to detect anomalous CPU usage in EC2 metrics. We'll leverage the Python library River, which provides an implementation of the HST algorithm, and Bytewax, a platform for creating data processing pipelines.

Building the Dataflow for Anomaly Detection

To create our dataflow, we'll first import the necessary libraries and set up Kafka connections. The following code snippet demonstrates how to create a dataflow with River and Bytewax to consume EC2 metrics from Kafka and detect anomalous CPU usage using HST:

import json
import os
import datetime as dt
from pathlib import Path

from bytewax.connectors.kafka import KafkaInput, KafkaOutput
from bytewax.dataflow import Dataflow
from bytewax.recovery import SqliteRecoveryConfig

from river import anomaly

kafka_servers = os.getenv(&quot;BYTEWAX_KAFKA_SERVER&quot;, &quot;localhost:9092&quot;)
kafka_topic = os.getenv(&quot;BYTEWAX_KAFKA_TOPIC&quot;, &quot;ec2_metrics&quot;)
kafka_output_topic = os.getenv(&quot;BYTEWAX_KAFKA_OUTPUT_TOPIC&quot;, &quot;ec2_metrics_anomalies&quot;)

# Define the dataflow object and kafka input.
flow = Dataflow()
flow.input(&quot;inp&quot;, KafkaInput(kafka_servers.split(&quot;,&quot;), [kafka_topic]))

# convert to percentages and group by instance id
def group_instance_and_normalize(key__data):
  _, data = key__data
  data = json.loads(data)
  data[&quot;value&quot;] = float(data[&quot;value&quot;]) / 100
  return data[&quot;instance&quot;], data

flow.map(group_instance_and_normalize)
# (&quot;c6585a&quot;, {&quot;index&quot;: &quot;1&quot;, &quot;value&quot;: &quot;0.11&quot;, &quot;instance&quot;: &quot;c6585a&quot;})

# Stateful operator for anomaly detection
class AnomalyDetector(anomaly.HalfSpaceTrees):

Our anomaly detector inherits from the HalfSpaceTrees object from the river package and has the following inputs

n_trees – defaults to 10 height – defaults to 8 window_size – defaults to 250 limits (Dict[Hashable, Tuple[float, float]]) – defaults to None seed (int) – defaults to None


  def __init__ (self, *args, **kwargs):
      super(). __init__ (*args, n_trees=5, height=3, window_size=5, seed=42, **kwargs)

  def update(self, data):
      self.learn_one({&quot;value&quot;: data[&quot;value&quot;]})
      data[&quot;score&quot;] = self.score_one({&quot;value&quot;: data[&quot;value&quot;]})
      if data[&quot;score&quot;] &gt; 0.7:
          data[&quot;anom&quot;] = 1
      else:
          data[&quot;anom&quot;] = 0
      return self, (
          data[&quot;index&quot;],
          data[&quot;timestamp&quot;],
          data[&quot;value&quot;],
          data[&quot;score&quot;],
          data[&quot;anom&quot;],
      )

flow.stateful_map(&quot;detector&quot;, lambda: AnomalyDetector(), AnomalyDetector.update)
# ((&quot;c6585a&quot;, {&quot;index&quot;: &quot;1&quot;, &quot;value&quot;:0.08, &quot;instance&quot;: &quot;fe7f93&quot;, &quot;score&quot;:0.02}))

# filter out non-anomalous values
flow.filter(lambda x: bool(x[1][4]))

flow.map(lambda x: (x[0], json.dumps(x[1][4])))
flow.output(&quot;output&quot;, KafkaOutput([kafka_servers], kafka_output_topic))

In this dataflow, we first read data from Kafka and deserialize the JSON message. We then normalize the CPU usage values and group them by the instance ID. Next, we apply the AnomalyDetector class inside a stateful operator, which calculates the anomaly score for each data point using HST. We set a threshold for the anomaly score (0.7 in this example) and mark data points as anomalous if their scores exceed the threshold. Finally, we filter out non-anomalous values and output the anomalous data points to a separate Kafka topic.

Using this dataflow, we can continuously monitor EC2 metrics and detect anomalous CPU usage, helping us identify potential issues in our infrastructure.

Creating a Dataflow docker image

dataflow-Dockerfile

FROM bytewax/bytewax:0.16.0-python3.9
RUN /venv/bin/pip install river==0.10.1 pandas confluent-kafka


docker build -f dataflow-Dockerfile -t $REPOSITORY_URI:\dataflow . 


docker push $REPOSITORY_URI:\dataflow

Deploying the Dataflow

To deploy the dataflow, we'll use the Bytewax command-line tool, waxctl. There are two options for deploying the dataflow, depending on how you have set up your Kafka server environment variable. When we deploy our dataflow we will set the processes (denoted by p) to 5 to match the number of partitions we set when we intially created our redpanda topic.

Option 1: Generate waxctl command

Use the following command to generate the waxctl command with the appropriate environment variables:

echo&quot;
waxctl df deploy ./dataflow.py \\
  --name ec2-cpu-ad \\
  -p 5 \\
  -i $REPOSITORY_URI \\
  -t dataflow \\
  -e &apos;\&quot;BYTEWAX_KAFKA_SERVER=$BROKERS\&quot;&apos; \\
  -e BYTEWAX_KAFKA_TOPIC_GROUP_ID=dataflow_group \\
  --debug
&quot;

This will output the waxctl command with the correct Kafka server values. Copy the output and run it to deploy the dataflow.

Option 2: Hardcoded BYTEWAX_KAFKA_SERVER value

If you prefer to hardcode the Kafka server values, use the following command to deploy the dataflow:

waxctl df deploy ./dataflow.py \
  --name ec2-cpu-ad \
  -p 5 \
  -i $REPOSITORY_URL \
  -t dataflow \
  -e &apos;&quot;BYTEWAX_KAFKA_SERVER=three-node-cluster-0.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-1.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-2.three-node-cluster.redpanda-bytewax.svc.cluster.local.&quot;&apos; \
  -e BYTEWAX_KAFKA_TOPIC_GROUP_ID=dataflow_group \
  --debug

Now that we have deployed our dataflow, after enough time, you'll be able to consume from the anomalies topic to see any anomalies.

rpk --brokers $BROKERS topic consume ec2_metrics_anomalies -o start

As a next step, you could deploy a dataflow to consume from the anomalies and alert you in Slack! Or add rerun like we demonstrated in the previous blog post to visualize the anomalies.

Conclusion

In this blog post, we have demonstrated how to set up a system for monitoring EC2 metrics and detecting anomalous CPU usage. By leveraging tools like Logstash, Redpanda, River, and Bytewax, we've created a robust and scalable pipeline for processing and analyzing streaming data.

This system provides a range of benefits, including:

Efficiently processing high-dimensional and high-velocity data streams
Using the Half Space Trees unsupervised machine learning algorithm for detecting anomalies in streaming data
Continuously monitoring EC2 metrics and identifying potential issues in the infrastructure

With this setup, you can effectively monitor your EC2 instances and ensure that your infrastructure is running smoothly, helping you proactively address any issues that may arise.

That's it! You now have a working cloud-based anomaly detection system using Bytewax, Redpanda, and AWS. Feel free to adapt this setup to your specific use case and explore the various features and capabilities offered by these tools.

Real-Time Anomaly Detection Visualization with Bytewax and Rerun

Zander — Thu, 13 Apr 2023 22:39:53 +0000

Rerun's open sourcing in February marked a significant step for those looking for accessible yet potent Python visualization libraries. Why is visualization important? Visualization is essential since companies like Scale.ai, Weights & Biases, and Hugging Face have streamlined deep learning by addressing dataset labeling, experiment tracking, and pre-trained models. However, a void still exists in rapid data capture and visualization.

Many companies develop in-house data visualization solutions but often end up with suboptimal tools due to high development costs. Moreover, Python visualization on streaming data is a problem that is not solved well either, leading to JavaScript based solutions in notebooks. Rerun leverages a Python interface into a high-performant Rust visualization engine (much like Bytewax!) that makes it dead easy to analyze streaming data.

In this blog post, we will explore how to use Bytewax and Rerun to visualize real-time streaming data in Python and create a real-time anomaly detection visualization. We chose anomaly detection, a.k.a. outlier detection, because it is a critical component in numerous applications, such as cybersecurity, fraud detection, and monitoring of industrial processes. Visualizing these anomalies in real time can aid in quickly identifying potential issues and taking necessary actions to mitigate them.

For those eager to dive in, check out our end-to-end Python solution on our GitHub. Don't forget to star Bytewax!

Overview

Here is what we'll cover:

We will navigate the code and briefly discuss top-level entities
Then we will discuss each step of the dataflow in greater detail: initialization of our dataflow, input source, stateful anomaly detection, data visualization & output, and how to spawn a cluster
Finally, we will learn how to run it and see the beautiful visualization, all in Python <3
As a bonus, we will think about other use cases

Let's go!

Setup your environment

This blog post is based on the following versions of Bytewax and Rerun:

bytewax==0.15.1
rerun-sdk==0.4.0

Rerun and Bytewax are installable as

pip install rerun-sdk
pip install bytewax

Follow Bytewax for updates as we are baking a new version that will ease the development of data streaming apps in Python further.

Code

The solution is relatively compact, so we copy the entire code example here. Please feel free to skip this big chunk if it looks overwhelming; we will discuss each function later.

import random
# pip install rerun-sdk
import rerun as rr

from time import sleep
from datetime import datetime

from bytewax.dataflow import Dataflow
from bytewax.execution import spawn_cluster
from bytewax.inputs import ManualInputConfig, distribute
from bytewax.outputs import ManualOutputConfig

rr.init(&quot;metrics&quot;)
rr.spawn()

start = datetime.now()

def input_builder(worker_index, worker_count, resume_state):
    assert resume_state is None
    keys = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;]
    this_workers_keys = distribute(keys, worker_index, worker_count)

    for _ in range(1000):
        for key in keys:
            value = random.randrange(0, 10)
            if random.random() &gt; 0.9:
                value *= 2.0
            yield None, (key, (key, value, (datetime.now() - start).total_seconds()))
            sleep(random.random() / 10.0)

class ZTestDetector:
    &quot;&quot;&quot;Anomaly detector.

    Use with a call to flow.stateful_map().

    Looks at how many standard deviations the current item is away
    from the mean (Z-score) of the last 10 items. Mark as anomalous if
    over the threshold specified.
    &quot;&quot;&quot;

    def __init__ (self, threshold_z):
        self.threshold_z = threshold_z

        self.last_10 = []
        self.mu = None
        self.sigma = None

    def _push(self, value):
        self.last_10.insert(0, value)
        del self.last_10[10:]

    def _recalc_stats(self):
        last_len = len(self.last_10)
        self.mu = sum(self.last_10) / last_len
        sigma_sq = sum((value - self.mu) ** 2 for value in self.last_10) / last_len
        self.sigma = sigma_sq**0.5

    def push(self, key __value__ t):
        key, value, t = key __value__ t
        is_anomalous = False
        if self.mu and self.sigma:
            is_anomalous = abs(value - self.mu) / self.sigma &gt; self.threshold_z

        self._push(value)
        self._recalc_stats()
        rr.log_scalar(f&quot;temp_{key}/data&quot;, value, color=[155, 155, 155])
        if is_anomalous:
            rr.log_point(f&quot;3dpoint/anomaly/{key}&quot;, [t, value, float(key) * 10], radius=0.3, color=[255,100,100])
            rr.log_scalar(
                f&quot;temp_{key}/data/anomaly&quot;,
                value,
                scattered=True,
                radius=3.0,
                color=[255, 100, 100],
            )
        else:
            rr.log_point(f&quot;3dpoint/data/{key}&quot;, [t, value, float(key) * 10], radius=0.1)

        return self, (value, self.mu, self.sigma, is_anomalous)

def output_builder(worker_index, worker_count):
    def inspector(input):
        metric, (value, mu, sigma, is_anomalous) = input
        print(
            f&quot;{metric}: &quot;
            f&quot;value = {value}, &quot;
            f&quot;mu = {mu:.2f}, &quot;
            f&quot;sigma = {sigma:.2f}, &quot;
            f&quot;{is_anomalous}&quot;
        )

    return inspector

if __name__ == &apos; __main__ &apos;:
    flow = Dataflow()
    flow.input(&quot;input&quot;, ManualInputConfig(input_builder))
    # (&quot;metric&quot;, value)
    flow.stateful_map(&quot;AnomalyDetector&quot;, lambda: ZTestDetector(2.0), ZTestDetector.push)
    # (&quot;metric&quot;, (value, mu, sigma, is_anomalous))
    flow.capture(ManualOutputConfig(output_builder))
    spawn_cluster(flow)

The provided code demonstrates how to create a real-time anomaly detection pipeline using Bytewax and Rerun. Let's break down the essential components of this code:

input_builder : This function generates random metrics simulating real-world data streams. It generates data points with a small chance of having an anomaly (values doubled).
ZTestDetector : This class implements an anomaly detector using the Z-score method. It maintains the mean and standard deviation of the last 10 values and marks a value as anomalous if its Z-score is greater than a specified threshold.
output_builder : This function is used to define the output behavior for the data pipeline. In this case, it prints the metric name, value, mean, standard deviation, and whether the value is anomalous.
Dataflow : The main part of the code constructs the dataflow using Bytewax, connecting the RandomMetricInput, ZTestDetector, and the output builder.
Rerun visualization : The Rerun visualization is integrated into the ZTestDetector class. The rr.log_scalar and rr.log_point functions are used to plot the data points and their corresponding anomaly status.

Now, with an understanding of the code's main components, let's discuss how the visualization is created step by step.

Building the Dataflow

To create a dataflow pipeline, you need to:

Initialize a new dataflow with flow = Dataflow().
Define the input source using flow.input("input", ManualInputConfig(input_builder)).
Apply the stateful anomaly detector using flow.stateful_map("AnomalyDetector", lambda: ZTestDetector(2.0), ZTestDetector.push).
Configure the output behavior with flow.capture(ManualOutputConfig(output_builder)).
Finally, spawn a cluster to execute the dataflow with spawn_cluster(flow, proc_count=3).

The resulting dataflow reads the randomly generated metric values from input_builder, passes them through the ZTestDetector for anomaly detection, and outputs the results using the output_builder function. Let's clarify the details for each step.

`input_builder` function

The input_builder function serves as an alternative input source for the dataflow pipeline, generating random metric values in a distributed manner across multiple workers. It accepts three parameters: worker_index, worker_count, and resume_state.

def input_builder(worker_index, worker_count, resume_state):
    assert resume_state is None
    keys = [&quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;]
    this_workers_keys = distribute(keys, worker_index, worker_count)

    for _ in range(1000):
        for key in keys:
            value = random.randrange(0, 10)
            if random.random() &gt; 0.9:
                value *= 2.0
            yield None, (key, (key, value, (datetime.now() - start).total_seconds()))
            sleep(random.random() / 10.0)

worker_index: The index of the current worker in the dataflow pipeline.
worker_count: The total number of workers in the dataflow pipeline.
resume_state: The state of the input source from which to resume. In this case, it is asserted to be None, indicating that the input source does not support resuming from a previous state.

Here's a step-by-step description of the input_builder function:

Assert that resume_state is None.
Define a list of keys representing the metrics.
Distribute the keys among the workers using the distribute function (not provided in the code snippet). The distributed keys for the current worker are assigned to this_workers_keys.
Iterate 1,000 times and, for each iteration, iterate through the list of keys:
- Generate a random value between 0 and 10.
- With a 10% probability, double the value to simulate an anomaly.
- Yield a tuple containing None (to indicate no specific partition key), the key, the generated value, and the elapsed time since the starting time (not provided in the code snippet).
- Introduce a sleep time between each generated value to simulate real-time data generation.

The input_builder function is used in the dataflow as the input source with the following line of code:

flow.input(&quot;input&quot;, ManualInputConfig(input_builder))

This line tells the dataflow to use the RandomMetricInput class to generate the input data for the pipeline.

`ZTestDetector` Class

The ZTestDetector class is an anomaly detector that uses the Z-score method to identify whether a data point is anomalous or not. The Z-score is the number of standard deviations a data point is from the mean of a dataset. If a data point's Z-score is higher than a specified threshold, it is considered anomalous.

The class has the following methods:

__init__ (self, threshold_z): The constructor initializes the ZTestDetector with a threshold Z-score value. It also initializes the last 10 values list (self.last_10), mean (self.mu), and standard deviation (self.sigma).
_push(self, value): This private method is used to update the list of last 10 values with the new value. It inserts the new value at the beginning of the list and removes the oldest value, maintaining the list length at 10.
_recalc_stats(self): This private method recalculates the mean and standard deviation based on the current values in the self.last_10 list.
push(self, key __value__ t): This public method takes a tuple containing a key, a value, and a timestamp as input. It calculates the Z-score for the value, updates the last 10 values list, and recalculates the mean and standard deviation. It also logs the data point and its anomaly status using Rerun's visualization functions. Finally, it returns the updated instance of the ZTestDetector class and a tuple containing the value, mean, standard deviation, and anomaly status.

The ZTestDetector class is used in the dataflow pipeline as a stateful map with the following code:

flow.stateful_map(&quot;AnomalyDetector&quot;, lambda: ZTestDetector(2.0), ZTestDetector.push)

This line tells the dataflow to apply the ZTestDetector with a Z-score threshold of 2.0 and use the push method to process the data points.

Visualizing Anomalies

To visualize the anomalies, the ZTestDetector class logs the data points and their corresponding anomaly status using Rerun's visualization functions. Specifically, rr.log_scalar is used to plot a scalar value, while rr.log_point is used to plot 3D points.

The following code snippet shows how the visualization is created:

rr.log_scalar(f&quot;temp_{key}/data&quot;, value, color=[155, 155, 155])
if is_anomalous:
    rr.log_point(f&quot;3dpoint/anomaly/{key}&quot;, [t, value, float(key) * 10], radius=0.3, color=[255,100,100])
    rr.log_scalar(
        f&quot;temp_{key}/data/anomaly&quot;,
        value,
        scattered=True,
        radius=3.0,
        color=[255, 100, 100],
    )
else:
    rr.log_point(f&quot;3dpoint/data/{key}&quot;, [t, value, float(key) * 10], radius=0.1)

Here, we first log a scalar value representing the metric. Then, depending on whether the value is anomalous, we log a 3D point with a different radius and color. Anomalous points are logged in red with a larger radius, while non-anomalous points are logged with a smaller radius.

`output_builder` Function

The output_builder function is used to define the output behavior for the data pipeline. In this specific example, it is responsible for printing the metric name, value, mean, standard deviation, and whether the value is anomalous. The function takes two arguments: worker_index and worker_count. These arguments help the function understand the index of the worker and the total number of workers in the dataflow pipeline.

Here's the definition of the output_builder function:

def output_builder(worker_index, worker_count):
    def inspector(input):
        metric, (value, mu, sigma, is_anomalous) = input
        print(
            f&quot;{metric}: &quot;
            f&quot;value = {value}, &quot;
            f&quot;mu = {mu:.2f}, &quot;
            f&quot;sigma = {sigma:.2f}, &quot;
            f&quot;{is_anomalous}&quot;
        )

    return inspector

This function is a higher-order function, which means it returns another function called inspector. The inspector function is responsible for processing the input data tuple and printing the desired output. The output builder function is later used in the dataflow pipeline when configuring the output behavior with

flow.capture(ManualOutputConfig(output_builder)).

Running the Dataflow

Bytewax can run as a single process or in a multi-process way. This dataflow has been created to scale across multiple processes, but we will start off running it as a single process with the spawn_cluster execution module.

spawn_cluster(flow)

If we wanted to increase the parallelism, we would simply add more processes as arguments. For example - spawn_cluster(flow, proc_count=3).

To run the provided code we can simply run it as a Python script, but first we need to install the dependencies.

Create a new file in the same directory as dataflow.py and name it requirements.txt. Add the following content to the requirements.txt file:

bytewax==0.15.1
rerun-sdk==0.4.0

Open a terminal in the directory containing the requirements.txt and dataflow.py files.

Install the dependencies using the following command:

pip install -r requirements.txt

And run the dataflow!

python dataflow.py

Expanding the Use Case

While the provided code serves as a basic example of real-time anomaly detection, you can expand this pipeline to accommodate more complex scenarios. For example:

Incorporate real-world data sources : Replace the RandomMetricInput class with a custom class that reads data from a real-world source, such as IoT sensors, log files, or streaming APIs.
Implement more sophisticated anomaly detection techniques : You can replace the ZTestDetector class with other stateful anomaly detection methods, such as moving average, exponential smoothing, or machine learning-based approaches.
Customize the Visualization : Enhance the Rerun visualization by adding more data dimensions, adjusting the color schemes, or modifying the plot styles to better suit your needs.
Integrate with alerting and monitoring systems : Instead of simply printing the anomaly results, you can integrate the pipeline with alerting or monitoring systems to notify the appropriate stakeholders when an anomaly is detected.

By customizing and extending the dataflow pipeline, you can create a powerful real-time anomaly detection and visualization solution tailored to your specific use case. The combination of Bytewax and Rerun offers a versatile and scalable foundation for building real-time data processing and visualization systems.

Conclusion

This blog post has demonstrated how to use Bytewax and Rerun to create a real-time anomaly detection visualization. By building a dataflow pipeline with Bytewax and integrating Rerun's powerful visualization capabilities, we can monitor and identify anomalies in our data as they occur.

Data Council: The Highlights of Day 2

Oli Makhasoeva — Sun, 26 Mar 2023 00:45:10 +0000

Welcome back, data enthusiasts! I'm excited to dive into the second installment of my blog series covering the extraordinary Data Council Conference. If you haven't already, be sure to check out my first post, which provided a comprehensive overview of the engaging talks and workshops from Day 1.

On Day 2, before sessions, we are organizing an informal #StreamBrew coffee gathering for early birds at 7:15 am at KesosTacos near the conference venue. RSVP here. I hope to mingle, network, and enjoy some scrumptious breakfast migas alongside morning coffee. If you've never had migas, don't worry - I haven't either - you won't experiment alone!

Panels

AI Panel

One of the most highly anticipated events on Day 2 of the Data Council Conference is the AI Panel. Though details about the panel's specific focus remain under wraps, the excitement is palpable. I expect a riveting discussion featuring top-tier experts, who will undoubtedly share their unique perspectives on artificial intelligence's current state and future directions. AI changes the world we are living in; it happens almost every week, every month, for sure!

How Investors Think About Data

Another must-attend event on Day 2 is the panel titled "How Investors Think About Data," featuring an impressive lineup of investment professionals. Gain valuable insights from Lauren Reeder, Partner at Sequoia Capital; Slater Stich, Partner at Bain Capital Ventures; Leigh Marie Braswell, Principal at Founders Fund; and Pete Soderling, Founder of Data Community Fund.

I work for a data-oriented startup. And given the current state of the economy, including the infamous SVB disaster, I am curious about what fundraising will look like in the mid-long term and how to maximize our chances to succeed. Also, Pete is the founder and chair of the Data Council conference, and I am eager to hear from him too!

Talks

Day 2 of the Data Council Conference offers three tracks, full schedule is here.

The first track, "Applied & Generative AI," covers topics such as Large Language/Transformer Models, generative AI, product-based implementations of new research methods, and exciting new features powered by machine learning inside products.

The second track, "Analytics," focuses on the latest tools, techniques, and best practices for extracting valuable insights from data. You'll learn how top teams are solving their analytics challenges and discover the best new tools in the process.

Finally, my favorite one, the "Data Culture & Community" track. It emphasizes fostering a vibrant data ecosystem and promoting collaboration among data professionals. Sessions in this track will highlight the role of community building, open-source projects, and knowledge sharing in advancing data science and data engineering.

In case you're torn between multiple sessions like me, remember that many of the presentations will be recorded and made available for viewing later. With that in mind, I will highlight only a fraction of what sparks my interest.

Tristan Zajonc - Generative AI for Product Builders

I always considered no-code or low-code solutions an excellent option for a non-technical (and technical, too, in some cases) founder to build a prototype and get their MVP out there as soon as possible without hiring a bunch of developers. DALL•E, MidJourney, and Stable Diffusion did a similar thing and unlocked creativity for the rest of us. In that light, Tristan's talk about the caveats and nuances of building products using generative AI is very well-timed and relevant.

Thomas Mickley-Doyle "How Vercel Builds Dozens of Metrics from One Heterogenous Table"

I remember quite a few blog posts about the importance of reacting quickly to changes. Partly because Bytewax is enabling real-time ML and because it's a hot topic. Thomas Mickley-Doyle from Vercel will also share their innovative approach to data-driven decision-making. Vercel's strategy has increased stakeholder participation in analytics, reduced troubleshooting time for outlier events, and eliminated the data team as a bottleneck for data-related tasks. Sounds like a lot of fun!

Katrina Riehl "Behind the Curtain: What it Takes to Support the World's Most Popular Open Source Communities"

Dr. Katrina Riehl is President of the Board of Directors at NumFOCUS, Head of the Streamlit Data Team at Snowflake, and Adjunct Lecturer at Georgetown University. If you are building an OOS-driven business or care about how the community perceives your brand (and you better do :)), her talk is a must-go. NumFOCUS is operating on a vast scale: 50 sponsored projects and 60 affiliated projects, including some of the world's most popular open-source projects like NumPy, Scipy, Jupyter, and Pandas. There is definitely a ton to learn from NumFOCUS and Katrina.

I can't wait to share more of the content from the conference itself! I expect no less than an unforgettable experience!

Data Council: The Highlights of Day 1

Oli Makhasoeva — Thu, 23 Mar 2023 05:44:54 +0000

The COVID-19 pandemic has profoundly impacted how we work and learn, and the conference industry is no exception. Many events have moved to virtual formats, allowing attendees to participate from the comfort of their own homes. I even built a business around it! And while I absolutely love virtual events and can talk about their advantages endlessly, there's an undeniable charm to in-person conferences, too.

After *three years * of remote work, I am thrilled to finally attend the Data Council conference in person in Austin and connect with fellow tech enthusiasts face-to-face as soon as next week!

The conference attracts diverse data professionals from various industries, and whilst I've been at events that featured data talks or data tracks and even organized a virtual data-focused conference myself, it's the first time when I have a chance to see so many professionals interested in the latest developments in data engineering, data science, machine learning, and AI.

Come say hi 👋 I'm also bringing Bytewax's swag that you don't want to miss, so let's keep in touch!.

Today I want to share some of the sessions that I found particularly exciting and would like to attend.

I have to split this post because it's too much to cover in one shot; you are reading about Day 1, March 28th.

Agenda

The conference features an action-packed schedule across three days, including regular and lightning talks, workshops, and even speaker office hours. The latter is especially helpful for newcomers to the community (like me), facilitating connections with experts.

Beyond the formal sessions, the conference also offers plenty of opportunities for informal networking (see this thread). We (Bytewax) are organizing #StreamBrew coffee on March 29th in the morning (7:15 AM) and #StreamBrew Beer in the evening on March 30th.

No wonder that with so much to offer this conference is a must-attend event for data folks!

Keynotes

As I said before, the conference's schedule is crowded, and keynotes are no exception. 2 on each day!

Shirshanka Das "Building a Control Plane for Data"

The conference kicks off with an exciting keynote by Shirshanka Das. Shirshanka is a co-founder and CEO of Acryl Data. He will discuss the control plane for data, a harmonizing layer powered by metadata that unifies data discovery, observability, quality, governance, and management. He will describe the fundamental characteristics of a control plane and explain the use cases that can be accomplished with a unified control plane.

I am obsessed with unification and simplification. It brings order and enables teams to work more effectively. Thrilled to hear Shirshanka's thoughts on how to do that for data stacks.

Jordan Tigani "Big Data is Dead"

Next up is Jordan Tigani of MotherDuck with an intriguing title, "Big Data is Dead." The conference's website didn't have a description of the talk at the time I was writing this, but I googled and found a fresh blog post by Jordan.
I have to admit, I was a little skeptical about the title as it sounds like clickbait (unrelated, but I have a background in Scala, and Scala is dead forever and dies every year again and again, so it's not news).

Nonetheless, Jordan is exceptionally qualified to talk about this topic, he shares graphs based on query logs, deal post-mortems, benchmark results, customer support tickets, customer conversations, service logs, and published blog posts. He has his points and I won't post spoilers by citing his blog post. Besides, I am sure he has more to share in his keynote.

Talks

There are three tracks on the day 1:

Data Engineering & Infra
Data Science & Algos
ML Ops & Platforms

It is challenging to choose what to highlight, and I might overlook or forget some talks, so if your favorite one is not on the list, please feel free to let me know on our Slack, or tag us on Twitter or LinkedIn, my DMs are open too.

Chad Sanderson "Data Contracts: Accountable Data Quality."

Chad Sanderson is the Founder of Data Quality Camp, and the Data Quality Camp's Slack is the friendliest place to be. The channels are active, members are helpful, and you can even shamelessly promote whatever you want in the #be-shameless :D

If you're interested in the data contracts, then Chad's talk is definitely worth checking out. He recently posted on his LinkedIn that it's going to be the most in-depth presentation yet on how they implemented data contracts at scale at Convoy.

You also want to attend Data Quality Camp's first-ever in-person happy hour on Monday the 27th at the Stay Put Brewery near the event venue.

Emily Curtin "Extinguishing the Garbage Fire of ML Testing"

The abstract of Emily Curtin's (Staff MLOps Engineer at Intuit Mailchimp) talk resonates with me, I also think that testing should be at the heart and mind of people implementing complex systems. Emily is focusing on testing in MLOps and Data Science, which I need to familiarize myself with, and I look forward to learning about it from her.

I also adore that she says in her bio that she gets paid to say "it depends" and "well actually."

Sophia Yang "How to Interpret & Explain Your Black-Box Models?"

Sophia Yang is a Senior Data Scientist and a Developer Advocate at Anaconda. She is highly knowledgeable about technology and passionate about data science and Python open-source communities.
I think we share many interests, so I'm not missing her talk in which she covers popular model explanation techniques such as explainable boosting machine, visual analytics, distillation, prototypes, saliency map, counterfactual, feature visualization, LIME, SHAP, interpretML, and TCAV.

Jules Damji & Antoni Baum "HuggingFace + Ray AIR Integration: A Python Developer's Guide to Scaling Transformers"

Last but not least, I want to highlight a talk by Jules Damji, who spoke at one of my events before (check out his handmade avatar from the pre Midjourney era). Jules and Antoni will talk about Hugging Face Transformers and Ray AIR. It's cutting-edge Machine Learning, and I'm always willing to discover more about it.

Workshops

At Data Council all workshops are included for free in the cost of your ticket so I will try to attend them too.

Maggie Hays & Paul Logan "URGENT! Help these Pets Find Homes: Working Across Teams in DataHub"

Maggie and Paul's workshop is about Long Tail Companions (a hypothetical pet adoption service). It is in crisis – its data infrastructure has ground to a halt, and they cannot process any adoptions. I care about pets, love fixing failures, and enjoy teamwork. All things combined, it sounds like an excellent session for me.

Erik Edelmann & Meredith Adler "How to Make Marketing Fall In Love with Data Modeling

Data Modeling applied to marketing is obviously something that I care about. I'm joining Erik and Meredith for a demo of the campaign they built at Hightouch. They will cover how the team modeled the data, validated the results, and created a reusable process to support future marketing campaigns.

🎈Community party

The day wraps up with a Community Party at 5:30 pm (kudos to Databand for supporting it).

Don't forget to attend Zander's awesome talk, I'll be giving away awesome swag there!

Also see you at #StreamBrew, RSVP here.

In the next posts I'll cover following days, stay tuned!
See in Austin!

UPD: Day 2

Using Language Models in a Streaming Context to Understand Financial Markets

Zander — Thu, 16 Mar 2023 08:09:14 +0000

For those who are eager to dive into the code, it's available:

bytewax / news-analyzer

Analyze financial news in real-time with Machine Learning

news-analyzer

Analyze financial news in real-time with Machine Learning

See FinancialNewsAnalysis.ipynb

View on GitHub

Effective analysis of news is crucial for understanding the world, especially when it comes to financial markets. Being able to quickly identify significant events, such as a major corporation being hacked and sensitive customer data being compromised, can enable you to respond rapidly and either capitalize on opportunities or minimize losses. In this blog post, we'll delve into how Bytewax and large language models can be leveraged to analyze financial news in real time, providing you with the ability to respond to breaking news more effectively. We need to answer at least three questions to implement our little project successfully:

Where do we get the data?
How do we analyze it?
How do we access the data source and perform analysis in real-time?

Data Source

For the data source used in this demo, we will use the Alpaca news API, which provides websocket access to news articles from Benzinga. To setup an account and create an API key and secret, you can follow the Alpaca documentation. You can use any websocket as a data source. A future follow-up will look at how we can build our own real-time news aggregation pipeline for analysis.

Content Analysis

We're obviously going to leverage Large Language Models (LLMs) to analyze news articles. And the best place which comes to mind when looking for LLMs is Hugging Face.Hugging Face is a company that provides a marketplace where researchers can release models and datasets on their hub that can then be used by other researchers and developers via their hosted model endpoints and their Transformers library. Firstly, we need to perform sentiment analysis on the headline, which can quickly provide valuable insights. For this, we'll use a fine-tuned BERT model called FinancialBERT. Then we will summarize the content of the article, and a fine-tuned BART model will come in handy for this. Both can be found on huggingface.co. We also are going to cover how we can use the Transformers library to run the models.

Real-Time Data Processing with Bytewax

If you are not familiar with Bytewax. Bytewax is a stateful stream processor that can be used to analyze data in real time with support for stateful operators like windowing and aggregation. Bytewax is especially suitable for workflows that leverage the Python ecosystem of tools, from data crunching tools like Pandas to machine learning-focused tools like Hugging Face Transformers. It also supports a variety of data sources, including websockets.

Let's get started analyzing the news in real-time. First things first! Dependencies:

!pip install bytewax transformers torch sentencepiece websocket-client

Constructing Our Dataflow

A Bytewax dataflow is a sequence of steps that transform data from an input source and then write it to an output. At each step an operator is used to control the flow of data; whether it should be filtered, aggregated or accumulated. Developers writing dataflows will write Python code that will do the data transformation at each step.

Input

To begin the dataflow, we'll create an input using the Alpaca websocket, which we'll use to subscribe to articles on multiple tickers. It's important to note that you'll require an Alpaca API key and secret, and it's recommended to store them as environment variables.

import os
import json

from bytewax.dataflow import Dataflow
from bytewax.inputs import ManualInputConfig, distribute

from websocket import create_connection

API_KEY = os.getenv("API_KEY")
API_SECRET = os.getenv("API_SECRET")

ticker_list = ["*"]


def input_builder(worker_index, worker_count, resume_state):
    state = resume_state or None
    worker_tickers = list(distribute(ticker_list, worker_index, worker_count))
    print({"subscribing to": worker_tickers})

    def news_input(worker_tickers, state):
        ws = create_connection("wss://stream.data.alpaca.markets/v1beta1/news")
        print(ws.recv())
        ws.send(json.dumps({"action":"auth","key":f"{API_KEY}","secret":f"{API_SECRET}"}))
        print(ws.recv())
        ws.send(json.dumps({"action":"subscribe","news":worker_tickers}))
        print(ws.recv())

        while True:
        # to use without API uncomment the below line and comment the one below that
        # articles = [{"T":"n","id":31248067,"headline":"Tesla Vehicles Could Be Banned From Leaving During A Hurricane In This State","summary":"A lawmaker in one American state could make it hard for owners of electric vehicles to get out of the state in the event of a hurricane. Here’s the potential law and why it’s important.","author":"Chris Katje","created_at":"2023-03-07T22:58:40Z","updated_at":"2023-03-07T22:58:40Z","url":"https://www.benzinga.com/news/23/03/31248067/tesla-vehicles-could-be-banned-from-leaving-during-a-hurricane-in-this-state","content":"\u003cp\u003eA lawmaker in one American state could make it hard for owners of electric vehicles to get out of the state in the event of a hurricane. Here\u0026rsquo;s the potential law and why it\u0026rsquo;s important.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cstrong\u003eWhat Happened:\u003c/strong\u003e States have passed laws aimed at banning the sale of gas-powered vehicles in the future. One state took it a step further by seeking to ban electric vehicle \u003ca href=\"https://www.benzinga.com/news/23/01/30424292/taking-on-elon-musk-this-state-legislature-could-ban-electric-vehicle-sales-by-2035\"\u003esales in the future.\u003c/a\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eOne of the leading states for electric vehicle purchases could now see a temporary ban on using electric vehicles during the time of a crisis.\u003c/p\u003e\r\n\r\n\u003cp\u003eFlorida Republican state Sen.\u0026nbsp;\u003cstrong\u003eJonathan Martin\u003c/strong\u003e is considering legislation to ban electric vehicles like those from \u003cstrong\u003eTesla Inc\u003c/strong\u003e (NASDAQ:\u003ca class=\"ticker\" href=\"https://www.benzinga.com/stock/TSLA#NASDAQ\"\u003eTSLA\u003c/a\u003e) to be used during hurricane evacuations in the state, according to \u003ca href=\"https://electrek.co/2023/03/06/florida-lawmaker-wants-to-ban-evs-from-hurricane-evacuations/\"\u003eElectrek\u003c/a\u003e.\u0026nbsp;\u003c/p\u003e\r\n\r\n\u003cp\u003eMartin told the state\u0026rsquo;s Department of Transportation that electric vehicles could block traffic during evacuations if they run out of battery charge.\u003c/p\u003e\r\n\r\n\u003cp\u003eMartin serves on the Committee on Environment and Natural Resources and the Select Committee on Resiliency.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe Select Committee on Resiliency met with the Florida Department of Transportation executive director of transportation technologies in Florida.\u003c/p\u003e\r\n\r\n\u003cp\u003eAmong the topics discussed were the $198 million the state is going to get from the Bipartisan Infrastructure Law for electric vehicle charging infrastructure from the current administration led by \u003cstrong\u003ePresident Joe Biden.\u003c/strong\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003eThe legislation requires electric vehicle charging stations to be 50 miles apart and serve all electric vehicles.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u0026ldquo;With a couple of guys behind you, you can\u0026rsquo;t get out of the car and push it to the side of the road. Traffic backs up. And what might look like a two-hour trip might turn into an eight-hour trip once you\u0026rsquo;re on the road,\u0026rdquo; Martin said.\u003c/p\u003e\r\n\r\n\u003cp\u003eMartin said his concern is with the electric vehicle infrastructure available in the state of Florida.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003eRelated Link: \u003ca href=\"https://www.benzinga.com/trading-ideas/22/06/27568560/4-stocks-to-watch-this-hurricane-season\"\u003e4 Stocks To Watch This Hurricane Season\u0026nbsp;\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cstrong\u003eWhy It\u0026rsquo;s Important:\u003c/strong\u003e The Florida Department of Transportation told Martin it isn\u0026rsquo;t a fan of banning electric vehicles during hurricane evacuations and that it is looking into portable EV chargers.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u0026ldquo;We have our emergency assistance vehicles that we deploy during a hurricane evacuation that have gas \u0026hellip; we need to provide that same level of service to electrical vehicles,\u0026rdquo; Department of Transportation director of transportation technologies \u003cstrong\u003eTrey Tillander \u003c/strong\u003esaid.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe Tampa Bay Times \u003ca href=\"https://www.tampabay.com/hurricane/2023/02/24/florida-lawmaker-suggests-limiting-electric-vehicles-during-hurricane-evacuations/\"\u003ereported\u003c/a\u003e\u0026nbsp;around 1% of the vehicles in Florida are electric vehicles. One of the owners of an EV is state Sen.\u0026nbsp;\u003cstrong\u003eTina Polsky.\u003c/strong\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u0026ldquo;I don\u0026rsquo;t think you can ban an electric vehicle from evacuating because that may be the only car someone has,\u0026rdquo; Polsky said.\u003c/p\u003e\r\n\r\n\u003cp\u003eIn December 2022, there were 203,094 electric vehicles registered in the state of Florida.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe increased funding for charging infrastructure could help ease concerns over charging.\u003c/p\u003e\r\n\r\n\u003cp\u003eUltimately, once people are on the road headed out of the state, they likely won\u0026rsquo;t be able to stop at a charging station, similar to people not being able to quickly stop at a gas station.\u003c/p\u003e\r\n\r\n\u003cp\u003eJust like people prepare for the evacuation by filling up their vehicle with gas, owners of electric vehicles will likely need to fully charge their vehicle before evacuating the state.\u003c/p\u003e\r\n\r\n\u003cp\u003eThe comments from the state senator may have Florida residents thinking about owning at least one non-electric vehicle or a hybrid to ensure they have the best chance to exit the state without future restrictions and without the potential of running out of charge and not finding stations prevalent.\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003eRead Next:\u0026nbsp;\u003ca href=\"https://www.benzinga.com/analyst-ratings/analyst-color/23/03/31172188/tesla-analysts-praise-vertical-integration-after-investor-day-but-want-more-from-el\"\u003eTesla Analysts Praise Vertical Integration After Investor Day, But Want More From Elon Musk: \u0026#39;Long On Vision, Short On Specifics\u0026#39;\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003ePhoto:\u0026nbsp;\u003ca href=\"https://www.shutterstock.com/g/hsaduraphotos\"\u003eHenryk Sadura\u003c/a\u003e\u0026nbsp;via Shutterstock\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cbr /\u003e\r\n\u0026nbsp;\u003c/p\u003e\r\n","symbols":["TSLA"],"source":"benzinga"}]
          articles = json.loads(ws.recv())
          for article in articles:
            yield state, (article["source"], article)

    return news_input(worker_tickers, state)


flow = Dataflow()
flow.input("inp", ManualInputConfig(input_builder))

The resulting data returned from the news API looks like the json shown here.

[{"T":"n","id":31248067,"headline":"Tesla Vehicles Could Be Banned From Leaving During A Hurricane In This State","summary":"A lawmaker in one American state could make it hard for owners of electric vehicles to get out of the state in the event of a hurricane. Here’s the potential law and why it’s important.","author":"Chris Katje","created_at":"2023-03-07T22:58:40Z","updated_at":"2023-03-07T22:58:40Z","url":"https://www.benzinga.com/news/23/03/31248067/tesla-vehicles-could-be-banned-from-leaving-during-a-hurricane-in-this-state","content":"\u003cp\u003eA lawmaker in one American state could make it hard for owners of electric vehicles ... ertical Integration After Investor Day, But Want More From Elon Musk: \u0026#39;Long On Vision, Short On Specifics\u0026#39;\u003c/a\u003e\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cem\u003ePhoto:\u0026nbsp;\u003ca href=\"https://www.shutterstock.com/g/hsaduraphotos\"\u003eHenryk Sadura\u003c/a\u003e\u0026nbsp;via Shutterstock\u003c/em\u003e\u003c/p\u003e\r\n\r\n\u003cp\u003e\u003cbr /\u003e\r\n\u0026nbsp;\u003c/p\u003e\r\n","symbols":["TSLA"],"source":"benzinga"}]

We will use this in the next steps in our dataflow to analyze the sentiment and provide a summary.

Managing Duplicates and Updates

When working with news stories from RSS/Atom feeds or news APIs, it's common to receive duplicates as they're created and then updated. To prevent these duplicates from being analyzed multiple times and incurring additional overhead of running ML models on the same story, we'll use the Bytewax operator stateful_map to create a simplified storage layer. We'll store a list of unique identifiers for each news article we encounter. If an article has been seen before, we'll mark it as an update. Otherwise, we'll add the article's ID to the stateful object. To filter out the updates and avoid reclassifying and summarizing them, we'll use the filter operator. Think of this process as the equivalent of checking a database for a unique ID.

def update_articles(articles, news):
    if news['id'] in articles:
        news['update'] = True
    else:
        articles.append(news['id'])
        news['update'] = False
    return articles, news

flow.stateful_map("source_articles", lambda: list(), update_articles)

flow.filter(lambda x: not x[1]['update'])

Sentiment Analysis

Sentiment analysis is the next step in our process. Our approach involves using a fine-tuned Hugging Face model to analyze the article's headline sentiment. We will be leveraging a BERT model for this purpose. BERT, which stands for Bidirectional Encoder Representations from Transformers, was developed by Google. For a detailed understanding of how this model operates and was trained, you can refer to the model card on Hugging Face or the accompanying research paper. Since we want to analyze each news article independently, the sentiment classification will take place in a map operator. Despite the extensive research that goes into designing novel model architectures and creating training datasets, implementing sentiment analysis is remarkably straightforward. Note that if you're following along in a notebook, the model will take some time to download initially.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, AutoModelForSeq2SeqLM

sent_tokenizer = AutoTokenizer.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
sent_model = AutoModelForSequenceClassification.from_pretrained("ahmedrachid/FinancialBERT-Sentiment-Analysis")
sent_nlp = pipeline("sentiment-analysis", model=sent_model, tokenizer=sent_tokenizer)

def sentiment_analysis(ticker__news):
    ticker, news = ticker__news
    sentiment = sent_nlp([news["headline"]])
    news['sentiment'] = sentiment[0]
    print(sentiment[0])
    return (ticker, news)

flow.map(sentiment_analysis)

Article Summarization

After analyzing the article sentiment, we will utilize a BART (Bidirectional Auto-Regressive Transformers) model architecture, which is a combination of Google's BERT and OpenAI's GPT architectures, to summarize its content. Despite the significant effort that goes into creating the model, implementing it with the Hugging Face Transformers library is relatively easy. We can generate a summarization pipeline and apply it in a map step. To obtain better results, we also incorporated an extra step into this map process, which involved cleaning the text before summarizing it.

import re

# Let's create a summarization pipeline
sum_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
sum_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
summarizer = pipeline("summarization", tokenizer=sum_tokenizer, model=sum_model)
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')

def summarize(ticker__news):
    ticker, news = ticker__news
    article = news['content']
    article_no_tags = tag_re.sub('', article)
    article_no_tags = article_no_tags.replace("\r", "").replace("\n", "")
    summary = summarizer(article_no_tags, max_length=130, min_length=30, do_sample=False)
    news['bart_summary'] = summary[0]['summary_text']
    print(f"bart summary:{summary[0]['summary_text']}")
    return (ticker, news)

flow.map(summarize)

Output

With our news analyzed, we can set a capture step to output the modified news object and then run our dataflow. For this instance we are going to write the output to StdOut so we can easily view it, but in a production system we could write the results to a downstream kafka topic or database for further analysis.

If you are following along in a notebook, remember you have to be authenticated for this to work and will need to set your Alpaca API key and secret

from bytewax.execution import run_main
from bytewax.outputs import StdOutputConfig

flow.capture(StdOutputConfig())

if __name__ == '__main__':
    run_main(flow)

Wrapping up

While our example is simplified, it showcases the power of Bytewax and Hugging Face's language models. We can easily analyze financial news articles in real-time, identify significant events, and make informed decisions: using the Alpaca news API as our data source, we were able to construct a dataflow that deduplicate stories and summarizes the content of each article.

The ease of implementation through Python native Bytewax and the Hugging Face Transformers library makes it accessible for data engineers and researchers to utilize these state-of-the-art language models in their own projects. We hope this blog post serves as a useful guide for anyone looking to leverage real-time news analysis in their financial decision-making process.

DEV Community: bytewax

M12 invests in the Future of Stream Processing with Bytewax

Data Parallel, Task Parallel, and Agent Actor Architectures

Introduction:

Data Parallel Architectures

Task Parallel Architectures: Unlocking Concurrent Processing

Agent Actor Architectures: Pioneering Concurrent Computations

Reasoning about Streaming vs Batch with a Case Study from GitHub

Understanding Real-Time and Stream Processing

The Relevance of Real-Time Data Processing

Types of Real-Time Workloads

Analytical workloads

Operational workloads

Analytical vs Operational

Case Studies. GitHub's Real-Time Data Processing Decisions

Trending Repositories and Developers: A Batch Processing Approach

Star Spam Detection: A Real-Time Processing Solution

The Impact of Real-Time Processing Decisions

Bytewax v0.16.2 is out!

Bytewax at Data Science Summit. Interactive Dashboards To Detect Data Anomalies In Real Time

Data Science Summit

Interactive dashboards to detect data anomalies in real time

Easy yet flexible way to display child routes in tabs with Vue 3

Prerequisites

Preparing routes & structure

Tabs component

Extending & styling up the tabs

Use custom route props

Add material icons to tabs navigation

Styling up the component

Instead of conclusion

Lessons we learned while building a stateful Kafka connector and tips for creating yours

Partitions

Listing Partitions

Building Partitions

Stateful Input Source

Resume State

Snapshotting

Resume

Delivery Guarantees

Non-Replay-Able Sources

How We Detect Anomalies In Our AWS Infrastructure (And Have Peaceful Nights)

Introduction

Setting Up the Required Infrastructure on AWS

Configuring Kubernetes and Redpanda

Step 1: Set up a namespace

Step 2: Install Cert-Manager and Redpanda Operator

Step 3: Create Redpanda cluster

Step 4: Set up topics

Exporting CloudWatch EC2 Metrics to our Redpanda Cluster with Logstash

Logstash Permissions

Deploy Logstash on Kubernetes

Building a Dataflow to Detect Anomalies with Bytewax

Anomaly Detection with Half Space Trees

Building the Dataflow for Anomaly Detection

Creating a Dataflow docker image

Deploying the Dataflow

Option 1: Generate waxctl command

Option 2: Hardcoded BYTEWAX_KAFKA_SERVER value

Conclusion

Real-Time Anomaly Detection Visualization with Bytewax and Rerun

Overview

Setup your environment

Code

Building the Dataflow

input_builder function

ZTestDetector Class

Visualizing Anomalies

output_builder Function

Running the Dataflow

Expanding the Use Case

Conclusion

Data Council: The Highlights of Day 2

Panels

AI Panel

How Investors Think About Data

Talks

Data Council: The Highlights of Day 1

Keynotes

Talks

`input_builder` function

`ZTestDetector` Class

`output_builder` Function