DEV Community: Oli Makhasoeva

M12 invests in the Future of Stream Processing with Bytewax

Oli Makhasoeva — Wed, 09 Aug 2023 16:25:50 +0000

At Bytewax, we're passionate about the power of real-time data. With AI and automation on the rise, accessing data instantly isn't just a cool perk—it's becoming a necessity. Our mission is to build software that will strip away the complexities of streaming and make it accessible for every developer to build real-time data applications.

We started with the Rust powered, open source Python stream processor, Bytewax, which is now a year and a half old, debuting in February 2022. Since starting the project we have grown and matured the Bytewax open source offering to include persistent state, different windowing configurations, and new operators for increased performance and scalability. We have also focused on bettering the developer experience from integration to deployment with our deployment tool, waxctl, the ability to rescale without losing data stored in state, and the ability to connect to various input and output sources as well as build your own.

We are excited to announce a new partner along our journey in M12/GitHub with their investment in Bytewax to support further development on the open source as well as the development of the Bytewax Platform, which will help businesses scale out their Bytewax usage starting with features like disaster recovery, collaboration and observability tools and a management layer.

How Bytewax Supports AI and Real-Time Applications

The world has moved into a new wave of computing where businesses power their operations and consumer interactions with AI. Sophisticated AI models require a real-time understanding of the world to make accurate decisions. What is often referred to as real-time ML is when a system reacts in real-time with a decision powered by an ML model to the inputs it receives. Stream processing and more importantly stream processing with a Python interface is pivotal for Real-time ML in order to transform data into features for models.

You can read more about real-time ML with Bytewax in our blog post here.

There are many other use cases currently being powered by Bytewax from monitoring and reacting to IoT sensors for vehicle fleets or across the energy grid, to monitoring market data or analyzing infrastructure.

New Investment: A Vote of Confidence

Microsoft is known for its investments in Python and AI. Creating partnerships with pivotal developers and teams that are moving the industry forward. Their investment in Bytewax is a vote of confidence towards the Bytewax vision and mission and the importance of stream processing in the next wave of computing.

“We believe that Zander and the Bytewax team are building a cutting edge tool that simplifies event and stream processing, and appreciate their thoughtful technical approach leveraging a Python framework to build highly scalable streaming dataflows” said Priyanka Mitra, Partner at M12 and co-founder of the M12 GitHub Fund. “We are impressed with their engagement of the open-source community and are committed to supporting Bytewax in accomplishing their mission, especially as they explore cutting edge AI and ML use cases” she added.

Future Bytewax

The Microsoft investment will help Bytewax establish a thriving community around the open source project and build out features for the paid platform to support adoption of the technology. We have been working to solve exceptionally hard problems like rescaling dataflows and cloud backup for disaster recovery as well as improving performance. We are excited to continue to bring features like these to Bytewax with a simple user interface and low complexity to support users across all stages of their journey.

Connect with us

We would love to hear from our users and any Python and streaming enthusiasts on how we can increase our support for workloads and Python development patterns. Please feel free to reach out via our slack community or the GitHub repo. We would also like to take this opportunity to thank our users, investors, and community for their continued support! If you like what we are building, please ⭐ the repo 😀.

Bytewax v0.16.2 is out!

Oli Makhasoeva — Thu, 08 Jun 2023 20:35:31 +0000

🎉 Exciting News from Bytewax! 🎉

We're thrilled to announce the release of Bytewax v0.16.2!

Firstly, support for Windows builds is here! 🖥️

This is a significant step forward not only because it makes Bytewax more accessible to developers across different platforms but also because we're particularly excited to welcome the first contribution from a member of our community Jim Zhang @zzl221000!

A big shout-out to Jim!!!

In addition to Windows support, v0.16.2 also introduces a CSVInput subclass of FileInput, further expanding the versatility of Bytewax.

Here's a quick rundown of what's changed in this release:

We're OSS and incredibly grateful for the community's contributions what you're building with Bytewax, and happy coding! 🚀 Check out the changes on our GitHub.

Bytewax at Data Science Summit. Interactive Dashboards To Detect Data Anomalies In Real Time

Oli Makhasoeva — Wed, 24 May 2023 21:41:03 +0000

Data Science Summit

Data Science Summit is the largest and oldest independent data science conference in the CEE region. This year, we are joining them online and our CEO, Zander Matheson, is presenting! For the sixth time Data Science Summit shares knowledge in topics ranging from analysis and processing (including big data), implementation issues to visualisation (BI) and management topics. This year's edition of the most important Data Science event in Poland dedicated to Machine Learning!

10 tracks, 100+ talks, the agenda is packed with cutting-edge insights 💡

🎟️ Use code DSSML23RP20 until 09.06.2023 to grab a Standard or PRO ticket at a 20% discount

Here are details of the talk Zander is presenting:

Interactive dashboards to detect data anomalies in real time

Join Zander for a technical exploration of crafting interactive dashboards that employ online machine learning algorithms for real-time anomaly detection across hundreds of sensors. He will guide you through how to set up a development environment with a streaming system (Kafka or similar), load sensor data to the streaming system with Bytewax, and write a dataflow using River that will transform the data and use different anomaly detection algorithms to determine if there are anomalies in the sensor data. The icing on the cake? Visualize all these complex processes on a dynamic, real-time dashboard using Rerun! Equip yourself with the tools and knowledge to monitor and react to data anomalies as they happen. Come, experience the power of Python in data anomaly detection and interactive visualization in real time!

If this abstract sounds interesting, you might want to check out these blogs: Real-Time Anomaly Detection Visualization with Bytewax and Rerun and Online Machine Learning for IoT. The talk is going to go beyond these but it covers same domains.

We are looking forward to exchange knowledge, share our ideas and learn from the experiences of other attendees and speakers. Stay tuned for updates from the conference!

Easy yet flexible way to display child routes in tabs with Vue 3

Oli Makhasoeva — Tue, 09 May 2023 18:22:17 +0000

Hello, I'm Konrad Sieńkowski and I am a front-end developer & UI designer here at Bytewax. I want to share with you something that I worked on recently. In this article, I'll walk through the steps to set up a new Vue application, configure the router for nested routes, create the AppTabs.vue component, and customize your tabs using route meta fields for labels and icons. By the end, you'll know how to make an easy yet flexible solution for displaying child routes in tabs. So, let's dive in!

For those eager to dive in, check out the project repository on Github.

Prerequisites

First of all, we're going to create a fresh, new application using > npm init vue@latest. The vue-create tool is going to ask you about including optional features in the project. The only one required for that tutorial is Vue Router. I chose Typescript & Prettier as well, but it's up to your personal preferences.

Preparing routes & structure

Once you follow the instructions on installing dependencies and running the app, you can start customizing the application. My first step was to simplify app.vue a bit:

&lt;template&gt;
  &lt;nav&gt;
    &lt;RouterLink to=&quot;/&quot;&gt;Home&lt;/RouterLink&gt;
    &lt;RouterLink to=&quot;/tabs&quot;&gt;Tabs demo&lt;/RouterLink&gt;
  &lt;/nav&gt;

  &lt;RouterView /&gt;
&lt;/template&gt;

&lt;script setup lang=&quot;ts&quot;&gt;
import { RouterLink, RouterView } from &apos;vue-router&apos;
&lt;/script&gt;

Since we're focusing on nested/child routes in this article, there's no need to spend much time on the homepage. I've also renamed default AboutView.vue to TabsView.vue and created bunch of example views in views/tabs, called TabsAbout.vue, TabsBlog.vue, TabsContact.vue, TabsRelated.vue. We're going to include them in our routes structure in the next step.

- views
-- tabs
--- TabsAbout.vue
--- TabsBlog.vue
--- TabsContact.vue
--- TabsRelated.vue
-- HomeView.vue
-- TabsView.vue

As we have a simple structure for our views/pages, now it's time to include them in router configuration. Let's open router/index.ts now and adjust it to our needs:

import { createRouter, createWebHistory } from &apos;vue-router&apos;
import HomeView from &apos;../views/HomeView.vue&apos;

const router = createRouter({
  history: createWebHistory(import.meta.env.BASE_URL),
  routes: [
    {
      path: &apos;/&apos;,
      name: &apos;home&apos;,
      component: HomeView
    },
    {
      path: &apos;/tabs&apos;,
      name: &apos;tabs&apos;,
      component: () =&gt; import(&apos;../views/TabsView.vue&apos;),
      children: [
        {
          name: &apos;about&apos;,
          path: &apos;&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsAbout.vue&apos;),
        },
        {
          name: &apos;blog&apos;,
          path: &apos;blog&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsBlog.vue&apos;),
        },
        {
          name: &apos;contact&apos;,
          path: &apos;contact&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsContact.vue&apos;),
        },
        {
          name: &apos;related&apos;,
          path: &apos;related&apos;,
          component: () =&gt; import(&apos;../views/tabs/TabsRelated.vue&apos;),
        },
      ]
    }
  ]
})

export default router

Now, our application has nested/children routes which we can use to display tabs in the component.

Tabs component

In this step, we're going to create our tab component, include it in the first-level route view and then extend it with additional features. First of all, we're going to create file called AppTabs.vue in components directory. Since our component is going to be flexible and might be used in different routes, we're following Vue naming convention for base components.

Let's start from the <script setup> section. We're using useRouter() composable there to access the router instance. Then, we're using it to define tabs computed property.

&lt;script setup lang=&quot;ts&quot;&gt;
import { computed, type ComputedRef } from &apos;vue&apos;
import { useRouter, RouterView, type RouteRecordRaw } from &apos;vue-router&apos;

// Use children routes for the tabs
const router = useRouter()
const tabs: ComputedRef&lt;RouteRecordRaw[] | undefined&gt; = computed(() =&gt; {
  const currentRoute = router.currentRoute.value.name
  return router.options.routes?.find(
    (route) =&gt;
      route.name === currentRoute || route.children?.find((child) =&gt; child.name === currentRoute)
  )?.children
})
&lt;/script&gt;

After getting the current route name using router.currentRoute property, we're using it to find it within the routes array (either within top-level routes and their children) and return its children routes. Now it's time to include it in the component template:

&lt;template&gt;
  &lt;div class=&quot;tabs&quot; v-if=&quot;tabs&quot;&gt;
    &lt;nav class=&quot;tabs__nav&quot;&gt;
      &lt;RouterLink
        v-for=&quot;tab in tabs&quot;
        :key=&quot;tab.name&quot;
        class=&quot;tabs__nav-item&quot;
        :to=&quot;{ name: tab.name }&quot;
      &gt;
        {{ tab.name }}
      &lt;/RouterLink&gt;
    &lt;/nav&gt;
    &lt;div class=&quot;tabs__wrapper&quot;&gt;
      &lt;RouterView v-slot=&quot;{ Component }&quot;&gt;
        &lt;Transition name=&quot;fade&quot; mode=&quot;out-in&quot;&gt;
          &lt;component :is=&quot;Component&quot;&gt;&lt;/component&gt;
        &lt;/Transition&gt;
      &lt;/RouterView&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/template&gt;

Inside the <div> wrapper, we have two parts of our component:

navigation / tabs, where we iterate over output of tabs computed getter and display links of children routes,
tabs wrapper, where we're using native <RouterView> and its v-slot api to wrap nested route's content in <Transition> component.

Now we can include our component in the TabsView.vue code:

&lt;template&gt;
  &lt;div class=&quot;view&quot;&gt;
    &lt;AppTabs /&gt;
  &lt;/div&gt;
&lt;/template&gt;

&lt;script setup lang=&quot;ts&quot;&gt;
import AppTabs from &apos;@/components/AppTabs.vue&apos;
&lt;/script&gt;

And take a look at the result:

Extending & styling up the tabs

Our tabs work nice, and we can easily include them in any view that has child routes. However, the tabs navigation uses route.name as a link label, and route names should rather remain simple and easy to use. We can extend our solution with route props to include custom tab label & icon for each child route.

Use custom route props

Before extending our component's code, let's add meta field to each nested route in router/index.ts:

children: [
  {
    name: &apos;about&apos;,
    path: &apos;&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsAbout.vue&apos;),
    meta: { tabLabel: &apos;About&apos; }
  },
  {
    name: &apos;blog&apos;,
    path: &apos;blog&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsBlog.vue&apos;),
    meta: { tabLabel: &apos;Blog&apos; }
  },
  {
    name: &apos;contact&apos;,
    path: &apos;contact&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsContact.vue&apos;),
    meta: { tabLabel: &apos;Contact&apos; }
  },
  {
    name: &apos;related&apos;,
    path: &apos;related&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsRelated.vue&apos;),
    meta: { tabLabel: &apos;Related&apos; }
  },
]

Now, we can use tabLabel value in our AppTabs.vue component:

&lt;RouterLink
  v-for=&quot;tab in tabs&quot;
  :key=&quot;tab.name&quot;
  class=&quot;tabs__nav-item&quot;
  :to=&quot;{ name: tab.name }&quot;
&gt;
  &lt;span class=&quot;tabs__nav-label&quot; v-if=&quot;tab.meta?.tabLabel&quot;&gt;{{ tab.meta.tabLabel }}&lt;/span&gt;
&lt;/RouterLink&gt;

Add material icons to tabs navigation

Our tabs navigation is going to look better with icons. Let's install Google's Material Symbols library using npm package: npm install material-symbols@latest and include it in main.ts (main.js if you're not using typescript):

import { createApp } from &apos;vue&apos;
import App from &apos;./App.vue&apos;
import router from &apos;./router&apos;

import &apos;material-symbols/outlined.css&apos;;
import &apos;./assets/main.css&apos;

const app = createApp(App)

app.use(router)

app.mount(&apos;#app&apos;)

Then, we can add tabIcon properties to route meta fields, filling it with the icon codes:

children: [
  {
    name: &apos;about&apos;,
    path: &apos;&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsAbout.vue&apos;),
    meta: { tabLabel: &apos;About&apos;, tabIcon: &apos;group&apos; }
  },
  {
    name: &apos;blog&apos;,
    path: &apos;blog&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsBlog.vue&apos;),
    meta: { tabLabel: &apos;Blog&apos;, tabIcon: &apos;feed&apos; }
  },
  {
    name: &apos;contact&apos;,
    path: &apos;contact&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsContact.vue&apos;),
    meta: { tabLabel: &apos;Contact&apos;, tabIcon: &apos;email&apos; }
  },
  {
    name: &apos;related&apos;,
    path: &apos;related&apos;,
    component: () =&gt; import(&apos;../views/tabs/TabsRelated.vue&apos;),
    meta: { tabLabel: &apos;Related&apos;, tabIcon: &apos;star&apos; }
  },
]

After that, we're ready to include them in the component:

&lt;RouterLink
  v-for=&quot;tab in tabs&quot;
  :key=&quot;tab.name&quot;
  class=&quot;tabs__nav-item&quot;
  :to=&quot;{ name: tab.name }&quot;
&gt;
  &lt;span class=&quot;tabs__nav-icon material-symbols-outlined&quot; v-if=&quot;tab.meta?.tabIcon&quot;&gt;{{
    tab.meta.tabIcon
  }}&lt;/span&gt;
  &lt;span class=&quot;tabs__nav-label&quot; v-if=&quot;tab.meta?.tabLabel&quot;&gt;{{ tab.meta.tabLabel }}&lt;/span&gt;
&lt;/RouterLink&gt;

Done! We have custom icons & labels based on route meta fields displayed in our Tabs component. Now it's time to add final styling touch with CSS.

Styling up the component

You can style up the component on your own, customizing it fully to your needs or use code below including it in AppTabs.vue below:

&lt;style&gt;
.tabs {
  border: 1px solid rgba(0, 0, 0, 0.2);
  border-radius: 0.5rem;
}
.tabs__wrapper {
  padding: 1.5rem 2rem 2rem 2rem;
}
.tabs__nav {
  display: flex;
  flex-direction: row;
  border-bottom: 1px solid rgba(0, 0, 0, 0.2);
}
.tabs__nav-item {
  display: flex;
  flex-direction: row;
  align-items: center;
  flex-wrap: nowrap;
  text-decoration: none;
  padding: 1rem;
  border-bottom: 3px solid transparent;
  margin-bottom: -1px;
  color: rgba(0, 0, 0, 0.87);
  transition: border-color 0.25s ease-in-out;
}
.tabs__nav-icon {
  margin-right: 0.5rem;
  color: rgba(0, 0, 0, 0.38);
}
.tabs__nav-item:hover {
  border-color: #ccc;
}
.tabs__nav-item.router-link-exact-active {
  border-color: var(--green);
  font-weight: 600;
}
&lt;/style&gt;

Note: Following BEM naming convention is easier using SCSS but I didn't want to fill the example with extra dependencies.

Our tab component looks pretty slick now:

Instead of conclusion

Now, I encourage you to give it a try, explore further customizations, and share your experiences and improvements with our community. Let's continue building more efficient and elegant applications together!

Lessons we learned while building a stateful Kafka connector and tips for creating yours

Oli Makhasoeva — Wed, 03 May 2023 20:16:54 +0000

The Bytewax framework is a flexible tool designed to meet the challenges faced by Python developers in today's data-driven world. It aims to provide seamless integrations and time-saving shortcuts for data engineers dealing with streaming data, making their work more efficient and effective. One of the important sides of developing Bytewax is input connectors. These connectors help in establishing the connection between the external systems and Bytewax to help users in importing data from external systems.

Here we're going to show how to write a custom input connector by walking through how we wrote our built-in Kafka input connector.

Writing input connectors for arbitrary systems while supporting failure recovery and strong delivery guarantees requires a solid understanding of how recovery works internal to Bytewax and the chosen output system. We strongly encourage you to use the connectors we have built into bytewax.connectors if possible, and read the documentation on their limits.

If you are interested in writing your own, this article can give you an introduction into some of the decisions involved in writing an input connector for an ordered, partitioned input stream.

If you need any help at all writing a connector, come say "hi" and ask questions in the Bytewax community Slack! We are happy to help!

Partitions

Writing a subclass for bytewax.inputs.PartitionedInput is the core API for writing an input connector when you have an input that has a fixed number of partitions. A partition is a "sub-stream" of data that can be read concurrently and independently.

To write a PartitionedInput subclass, you need to answer three questions:

How many partitions are there?
How can I build a source that reads a single partition?
How can I rewind a partition and read from a specific item?

This is done via the abstract methods list_parts, build_part, and the resume_state variable respectively.

We're going to use the confluent-kafka package to actually communicate with the Kafka cluster. Let's import all the things we'll need for this input source.

from typing import Dict, Iterable

from confluent_kafka import (
    Consumer,
    KafkaError,
    OFFSET_BEGINNING,
    TopicPartition,
)
from confluent_kafka.admin import AdminClient

from bytewax.inputs import PartitionedInput, StatefulSource

Our KafkaInput connector is going to read from a specific set of topics on a cluster. First, let's define our class and write a constructor that takes all the arguments that make sense for configuring this specific kind of input source. This is going to be the public entry point to this connector, and is what you'll pass to the bytewax.dataflow.Dataflow.input operator.

class KafkaInput(PartitionedInput):
    def __init__ (
        self,
        brokers: Iterable[str],
        topics: Iterable[str],
        tail: bool = True,
        starting_offset: int = OFFSET_BEGINNING,
        add_config: Dict[str, str] = None,
    ):
        add_config = add_config or {}

        if isinstance(brokers, str):
            raise TypeError(&quot;brokers must be an iterable and not a string&quot;)
        self

Listing Partitions

Next, let's answer question one: how many partitions are there? Conveniently, confluent-kafka provides an AdminClient.list_topics which give you the partition count of each topic, packed deep in a metadata object. The signature of PartitionedInput.list_parts says it must return a set of strings with IDs of all the partitions. Let's build the AdminClient using our configuring instance variables and then delegate to a _list_parts function so we can re-use it if necessary.

# Continued
# class KafkaInput(PartitionedInput):
    def list_parts(self):
        config = {
            &quot;bootstrap.servers&quot;: &quot;,&quot;.join(self._brokers),
        }
        config.update(self._add_config)
        client = AdminClient(config)

        return set(_list_parts(client, self._topics))

This function unpacks the nested metadata returned from AdminClient.list_topics, and returns a string that looks like "3-my_topic" for the third partition in the topic my_topic.

def _list_parts(client, topics):
    for topic in topics:
        # List topics one-by-one so if auto-create is turned on,
        # we respect that.
        cluster_metadata = client.list_topics(topic)
        topic_metadata = cluster_metadata.topics[topic]
        if topic_metadata.error is not None:
            raise RuntimeError(
                f&quot;error listing partitions for Kafka topic `{topic!r}`: &quot;
                f&quot;{topic_metadata.error.str()}&quot;
            )
        part_idxs = topic_metadata.partitions.keys()
        for i in part_idxs:
            yield f&quot;{i}-{topic}&quot;

How do you decide what the partition ID string should be? It should be something that globally identifies this partition, hence combining partition number and topic name.

PartitionedInput.list_parts might be called multiple times from multiple workers as a Bytewax cluster is setup and resumed, so it must return exactly the same set of partitions on every call in order to work correctly. Changing numbers of partitions is not currently supported with recovery.

Building Partitions

Next, let's answer question two: how can I build a source that reads a single partition? We can use confluent-kafka's Consumer to make a Kafka consumer that will read a specific topic and partition starting from an offset. The signature of PartitionedInput.build_part takes a specific partition ID (we'll ignore the resume state for now) and must return a stateful source.

We parse the partition ID to determine which Kafka partition we should be consuming from. (Hence the importance of having a globally unique partition ID.) Then we build a Consumer that connects to the Kafka cluster, and build our custom _KafkaSource stateful source. That is where the actual reading of input items happens.

# Continued
# class KafkaInput(PartitionedInput):
    def build_part(self, for_part, resume_state):
        part_idx, topic = for_part.split(&quot;-&quot;, 1)
        part_idx = int(part_idx)
        assert topic in self._topics, &quot;Can&apos;t resume from different set of Kafka topics&quot;

        config = {
            # We&apos;ll manage our own &quot;consumer group&quot; via recovery
            # system.
            &quot;group.id&quot;: &quot;BYTEWAX_IGNORED&quot;,
            &quot;enable.auto.commit&quot;: &quot;false&quot;,
            &quot;bootstrap.servers&quot;: &quot;,&quot;.join(self._brokers),
            &quot;enable.partition.eof&quot;: str(not self._tail),
        }
        config.update(self._add_config)
        consumer = Consumer(config)
        return _KafkaSource(
            consumer, topic, part_idx, self._starting_offset, resume_state
        )

Stateful Input Source

What is a stateful source? It is defined by subclassing bytewax.inputs.StatefulSource. You can think about it as a "snapshot-able Python iterator": something that produces a stream of items via StatefulSource.next, and also lets the Bytewax runtime ask for a snapshot of the position of the source via StatefulSource.snapshot.

Our _KafkaSource is going to read items from a specific Kafka topic's partition. Let's define that class and have a constructor that takes in all the details to start reading that partition: the consumer (already configured to connect to the correct Kafka cluster), the topic, the specific partition index, the default starting offset (beginning or end of the topic), and again we'll ignore the resume state for just another moment.

class _KafkaSource(StatefulSource):
    def __init__ (self, consumer, topic, part_idx, starting_offset, resume_state):
        self._offset = resume_state or starting_offset
        # Assign does not activate consumer grouping.
        consumer.assign([TopicPartition(topic, part_idx, self._offset)])
        self._consumer = consumer
        self._topic = topic

The beating heart of the input source is the StatefulSource.next method. It is periodically called by Bytewax and behaves similar to a built-in Python iterator's __next__ method. It must do one of three things: return a new item to send into the dataflow, return None signaling that there is no data currently but might be later, or raise StopIteration when the partition is complete.

Consumer.poll gives us a method to ask if there are any new messages on the partition we setup this consumer to follow. And if there are, unpack the data message and return it. Otherwise handle the no data case, the end-of-stream case, or an exceptional error case.

# Continued
# class _KafkaSource(StatefulSource):
    def next(self):
        msg = self._consumer.poll(0.001) # seconds
        if msg is None:
            return
        elif msg.error() is not None:
            if msg.error().code() == KafkaError._PARTITION_EOF:
                raise StopIteration()
            else:
                raise RuntimeError(
                    f&quot;error consuming from Kafka topic `{self.topic!r}`: {msg.error()}&quot;
                )
        else:
            item = (msg.key(), msg.value())
            # Resume reading from the next message, not this one.
            self._offset = msg.offset() + 1
            return item

An important thing to note here is that StatefulSource.next must never block. The Bytewax runtime employs a sort of cooperative multitasking, and so each operator must return quickly, even if it has nothing to do, so other operators in the dataflow that do have work can run. Unfortunately, currently there is no way in the Bytewax API to prevent polling of input sources (as input comes from outside the dataflow, Bytewax has no way of knowing when more data is available, so must constantly check). The best practice here is to pause briefly if there is no data to prevent a full spin-loop on no new data, but not so long you block other operators from doing their work.

There is also a StatefulSource.close method which enables you to do any well-behaved shutdown when EOF is reached. This is not guaranteed to be called in a failure situation and should not be crucial to the connecting system. In this case, Consumer.close does graceful shutdown.

# class _KafkaSource(StatefulSource):
    def close(self):
        self._consumer.close()

Resume State

Lets explain how failure recovery works for input connectors. Bytewax's recovery system allows the dataflow to quickly resume processing and output without needing to replay all input. It does this by periodically snapshot all internal state, input positions, and output positions of the dataflow. Then when it needs to recover after a failure, it loads all state from a recent snapshot, and starts re-playing input items in the same order from the instant of the snapshot and overwriting output items. This will cause the state and output of the dataflow to evolve in the same way during the resume execution as during the previous execution.

Snapshotting

So, we need to keep track of the current position somewhere in each partition. Kafka has the concept of message offsets, which is an incrementing immutable integer that is the position of each message. In _KafkaSource.next, we kept track of the offset of the next message that partition will read via self._offset = msg.offset() + 1.

Bytewax calls StatefulSource.snapshot when it needs to record that partition's position and returns that internally stored next message offset.

# Continued
# class _KafkaSource(StatefulSource):
    def snapshot(self):
        return self._offset

Resume

On resume after a failure, Bytewax's recovery machinery does the hard work of collecting all the snapshots, finding the ones that represent a coherent set of states across the previous execution's cluster, and threading each bit of snapshot data back through into PartitionedInput.build_part for the same partition. To properly take advantage of that, your resulting partition must resume reading from the same spot represented by that snapshot.

Since we were storing the Kafka message offset of the next message to be read in _KafkaSource._offset, we need to ensure we thread through that message offset back into the Consumer when it is built. That happens via passing resume_state into the _KafkaSource constructor, and it assigning that consumer to start reading from that offset. Looking at that code again:

# Continued
# class _KafkaSource(StatefulSource):
# def __init__ (self, consumer, topic, part_idx, starting_offset, resume_state):
        self._offset = resume_state or starting_offset
        # Assign does not activate consumer grouping.
        consumer.assign([TopicPartition(topic, part_idx, self._offset)])
        ...

As one extra wrinkle, if there is no resume state for this partition if the partition is being built for the first time, None will be passed for resume_state in PartitionedInput.build_part. In that case, we need to fill in the requested "default starting offset": either "beginning of topic" or "end of topic". In the case where we do have resume state, we should ignore that since we need to start from the specific offset to uphold the recovery contract.

Delivery Guarantees

Let's talk for a moment about how this recovery model with snapshots impacts delivery guarantees. A well-designed input connector on its own can only guarantee that the output of a dataflow to a downstream system is at-least-once: the recovery system will ensure that we replay any input that might not have been output due to where the execution cluster failed, but it requires coordination with the output connector (via something like transactions or two-phase commits) to ensure that the replay does not result in duplicated writes downstream and exactly-once processing.

Non-Replay-Able Sources

If your input source does not have the ability to replay old data, you can still use it with Bytewax, but your delivery guarantees are limited to at-least-once. For example, listening to an ephemeral SSE or WebSocket stream, you can always start listening, but often the request API does not let you specify an ability to replay missing events. When Bytewax attempts to resume, all the other operators will have their internal state returned to that last coherent snapshot, but since the input sources do not rewind, it will appear that the dataflow has missed out on all input between when that snapshot was taking and resume.

In this case, your StatefulSource.snapshot can return None and no recovery data will be saved. You can then ignore the resume_state argument of PartitionedInput.build_part because it will always be None.

How We Detect Anomalies In Our AWS Infrastructure (And Have Peaceful Nights)

Oli Makhasoeva — Tue, 02 May 2023 18:50:54 +0000

Introduction

Everyone who's using a cloud provider wants to monitor the system to detect anomalies in the usage. We run some internal data services, our website/blog and a few demo clusters on AWS and we wanted a low-maintenance way to monitor the infrastructure for issues, so we took the opportunity to dogfood Bytewax, of course :).

In this blog post, we will walk you through the process of building a cloud-based anomaly detection system using Bytewax, Redpanda, and Amazon Web Services (AWS). Our goal is to create a dataflow that detects anomalies in EC2 instance CPU utilization. To achieve this, we will collect usage data from AWS CloudWatch using Logstash and store it using Redpanda, a Kafka-compatible streaming data platform. Finally, we will use Bytewax, a Python stream processor, to build our anomaly detection system.

This is exactly the same infrastructure we use internally at Bytewax and, in fact, we haven't touched it for months!

Setting Up the Required Infrastructure on AWS

Before we begin, ensure that you have the following prerequisites set up:

AWS CLI configured with admin access
Helm
Docker
A Kubernetes cluster running in AWS and kubectl configured to access it

Configuring Kubernetes and Redpanda

In this section, we will configure Kubernetes and Redpanda using the provided code snippets. Make sure you have a running Kubernetes cluster in AWS and kubectl configured to access it.

Step 1: Set up a namespace

Create a new namespace for Redpanda and set it as the active context:

kubectl create ns redpanda-bytewax


kubectl config set-context --current --namespace=redpanda-bytewax

Step 2: Install Cert-Manager and Redpanda Operator

The Redpanda operator requires cert-manager to create certificates for TLS communication. To install cert-manager with Helm:

helm repo add jetstack https://charts.jetstack.io &amp;&amp; \
helm repo update &amp;&amp; \
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.4.4 \
  --set installCRDs=true

Fetch the latest Redpanda Operator version, add the Redpanda Helm repo, and install the Redpanda Operator:

export VERSION=$(curl -s https://api.github.com/repos/redpanda-data/redpanda/releases/latest | jq -r .tag_name)


helm repo add redpanda https://charts.vectorized.io/ &amp;&amp; helm repo update


kubectl apply -k https://github.com/redpanda-data/redpanda/src/go/k8s/config/crd?ref=$VERSION


helm install redpanda-operator redpanda/redpanda-operator --namespace redpanda-system --create-namespace --version $VERSION

Step 3: Create Redpanda cluster

Save the following YAML configuration in a file named 3_node_cluster.yaml:

apiVersion: redpanda.vectorized.io/v1alpha1
kind: Cluster
metadata:
  name: three-node-cluster
spec:
  image: &quot;vectorized/redpanda&quot;
  version: &quot;latest&quot;
  replicas: 3
  resources:
    requests:
      cpu: 1
      memory: 1.2Gi
    limits:
      cpu: 1
      memory: 1.2Gi
  configuration:
    rpcServer:
      port: 33145
    kafkaApi:
    - port: 9092
    pandaproxyApi:
    - port: 8082
    schemaRegistry:
      port: 8081
    adminApi:
    - port: 9644
    developerMode: true

Apply the Redpanda cluster configuration:

kubectl apply -f ./3_node_cluster.yaml

Check the status of Redpanda pods:

kubectl get po -lapp.kubernetes.io/component=redpanda

Export the broker addresses:

export BROKERS=`kubectl get clusters three-node-cluster -o=jsonpath=&apos;{.status.nodes.internal}&apos; | jq -r &apos;join(&quot;,&quot;)&apos;`

Step 4: Set up topics

Run an rpk container to create and manage topics:

kubectl run rpk-shell --rm -i --tty --image vectorized/redpanda --command /bin/bash

In the rpk terminal, export the broker addresses:

export BROKERS=three-node-cluster-0.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-1.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-2.three-node-cluster.redpanda-bytewax.svc.cluster.local.

View the cluster information:

rpk --brokers $BROKERS cluster info

Create two topics with 5 partitions each:

rpk --brokers $BROKERS topic create ec2_metrics -p 5


rpk --brokers $BROKERS topic create ec2_metrics_anomalies -p 5

List the topics:

rpk --brokers $BROKERS topic list

Consume messages from the ec2_metrics topic:

rpk --brokers $BROKERS topic consume ec2_metrics -o start

Exporting CloudWatch EC2 Metrics to our Redpanda Cluster with Logstash

Logstash is an open-source data processing pipeline that can ingest data from multiple sources, transform it, and send it to various destinations, such as Redpanda. In this case, we'll use Logstash to collect EC2 metrics from CloudWatch and send them to our Redpanda cluster for further processing.

Logstash Permissions

First, we need to create an AWS policy and user with the required permissions for Logstash to access CloudWatch and EC2. Save the following JSON configuration in a file named cloudwatch-logstash-policy.json:

{
    &quot;Version&quot;: &quot;2012-10-17&quot;,
    &quot;Statement&quot;: [
        {
            &quot;Sid&quot;: &quot;Stmt1444715676000&quot;,
            &quot;Effect&quot;: &quot;Allow&quot;,
            &quot;Action&quot;: [
                &quot;cloudwatch:GetMetricStatistics&quot;,
                &quot;cloudwatch:ListMetrics&quot;
            ],
            &quot;Resource&quot;: &quot;*&quot;
        },
        {
            &quot;Sid&quot;: &quot;Stmt1444716576170&quot;,
            &quot;Effect&quot;: &quot;Allow&quot;,
            &quot;Action&quot;: [
                &quot;ec2:DescribeInstances&quot;
            ],
            &quot;Resource&quot;: &quot;*&quot;
        }
    ]
}

Now we can create the policy and user, and attach the policy to the user:

aws iam create-policy --policy-name CloudwatchLogstash --policy-document file://cloudwatch-logstash-policy.json
aws iam create-user --user-name logstash-user


export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query &quot;Account&quot; --output text)


aws iam attach-user-policy --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/CloudwatchLogstash --user-name logstash-user

To provide access, we can create Kubernetes secrets for the AWS access key and secret access key:

kubectl create secret generic aws-secret-access-key --from-literal=value=$(aws iam create-access-key --user-name logstash-user | jq -r .AccessKey.SecretAccessKey)


kubectl create secret generic aws-access-key-id --from-literal=value=$(aws iam list-access-keys --user-name logstash-user --query &quot;AccessKeyMetadata[0].AccessKeyId&quot; --output text)

Now we can create an Amazon Elastic Container Registry (ECR) repository to store the custom Logstash image:

aws ecr create-repository --repository-name redpanda-bytewax


export REPOSITORY_URI=$(aws ecr describe-repositories --repository-names redpanda-bytewax --profile sso-admin --output text --query &quot;repositories[0].repositoryUri&quot;)

Next, we create a Logstash Image with CloudWatch Input Plugin installed by creating a Dockerfile named logstash-Dockerfile that has the plugin installed as a RUN step in the Dockerfile like shown in the dockerfile code snippet:

FROM docker.elastic.co/logstash/logstash:7.17.3
RUN bin/logstash-plugin install logstash-input-cloudwatch

Finally, we build and push the Logstash image to the ECR repository:

docker build -f logstash-Dockerfile -t $REPOSITORY_URI:\logstash-cloudwatch .


export AWS_REGION=us-west-2


aws ecr get-login-password --region $AWS_REGION --profile sso-admin | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com


docker push $REPOSITORY_URI:\logstash-cloudwatch

Deploy Logstash on Kubernetes

Now that we have our custom Logstash image, we will deploy it on Kubernetes using the Helm chart provided by Elastic. First, we need to gather some information and create a logstash-values.yaml file with the necessary configuration.

Run the following commands to obtain the required information:

echo $REPOSITORY_URI


echo $AWS_REGION


echo $BROKERS | sed -e &apos;s/local\./local\:9092/g&apos;

Create a logstash-values.yaml file and replace the placeholders (shown with <>) with the information obtained above:

image: &quot;&lt;YOUR REPOSITORY URI&gt;&quot;
imageTag: &quot;logstash-cloudwatch&quot;
imagePullPolicy: &quot;Always&quot;

persistence:
  enabled: true

logstashConfig:
  logstash.yml: |
    http.host: 0.0.0.0
    xpack.monitoring.enabled: false

logstashPipeline:
  uptime.conf: |
    input {
      cloudwatch {
        namespace =&gt; &quot;AWS/EC2&quot;
        metrics =&gt; [&quot;CPUUtilization&quot;]
        region =&gt; &quot;&lt;YOUR AWS REGION&gt;&quot;
        interval =&gt; 300
        period =&gt; 300
      }       
    }
    filter {
      mutate {
        add_field =&gt; {
          &quot;[index]&quot; =&gt; &quot;0&quot;
          &quot;[value]&quot; =&gt; &quot;%{maximum}&quot;
          &quot;[instance]&quot; =&gt; &quot;%{InstanceId}&quot;                      
        }
      }
    }
    output {
        kafka {
          bootstrap_servers =&gt; &quot;&lt;YOUR REDPANDA BROKERS&gt;&quot;
          topic_id =&gt; &apos;EC2Metrics&apos;
          codec =&gt; json
        }
    }

extraEnvs:
  - name: &apos;AWS_ACCESS_KEY_ID&apos;
    valueFrom:
      secretKeyRef:
        name: aws-access-key-id
        key: value
  - name: &apos;AWS_SECRET_ACCESS_KEY&apos;
    valueFrom:
      secretKeyRef:
        name: aws-secret-access-key
        key: value

With the logstash-values.yaml file ready, install the Logstash Helm chart:

helm upgrade --install logstash elastic/logstash -f logstash-values.yaml

Now to verify that Logstash is exporting the EC2 metrics to the Redpanda cluster, open a terminal with rpk and consume the ec2_metrics topic:

rpk --brokers $BROKERS topic consume ec2_metrics -o start

Use CTRL-C to quit the rpk terminal when you're done.

Building a Dataflow to Detect Anomalies with Bytewax

With our infrastructure in place, it's time to build a dataflow to detect anomalies. We will use Bytewax and Waxctl to define and deploy a dataflow that processes the EC2 instance CPU utilization data stored in the Redpanda cluster.

Anomaly Detection with Half Space Trees

Half Space Trees (HST) is an unsupervised machine learning algorithm used for detecting anomalies in streaming data. The algorithm is designed to efficiently handle high-dimensional and high-velocity data streams. HST builds a set of binary trees to partition the feature space into half spaces, where each tree captures a different view of the data. By observing the frequency of points falling into each half space, the algorithm can identify regions that are less dense than others, suggesting that data points within those regions are potential anomalies.

In our case, we will use HST to detect anomalous CPU usage in EC2 metrics. We'll leverage the Python library River, which provides an implementation of the HST algorithm, and Bytewax, a platform for creating data processing pipelines.

Building the Dataflow for Anomaly Detection

To create our dataflow, we'll first import the necessary libraries and set up Kafka connections. The following code snippet demonstrates how to create a dataflow with River and Bytewax to consume EC2 metrics from Kafka and detect anomalous CPU usage using HST:

import json
import os
import datetime as dt
from pathlib import Path

from bytewax.connectors.kafka import KafkaInput, KafkaOutput
from bytewax.dataflow import Dataflow
from bytewax.recovery import SqliteRecoveryConfig

from river import anomaly

kafka_servers = os.getenv(&quot;BYTEWAX_KAFKA_SERVER&quot;, &quot;localhost:9092&quot;)
kafka_topic = os.getenv(&quot;BYTEWAX_KAFKA_TOPIC&quot;, &quot;ec2_metrics&quot;)
kafka_output_topic = os.getenv(&quot;BYTEWAX_KAFKA_OUTPUT_TOPIC&quot;, &quot;ec2_metrics_anomalies&quot;)

# Define the dataflow object and kafka input.
flow = Dataflow()
flow.input(&quot;inp&quot;, KafkaInput(kafka_servers.split(&quot;,&quot;), [kafka_topic]))

# convert to percentages and group by instance id
def group_instance_and_normalize(key__data):
  _, data = key__data
  data = json.loads(data)
  data[&quot;value&quot;] = float(data[&quot;value&quot;]) / 100
  return data[&quot;instance&quot;], data

flow.map(group_instance_and_normalize)
# (&quot;c6585a&quot;, {&quot;index&quot;: &quot;1&quot;, &quot;value&quot;: &quot;0.11&quot;, &quot;instance&quot;: &quot;c6585a&quot;})

# Stateful operator for anomaly detection
class AnomalyDetector(anomaly.HalfSpaceTrees):

Our anomaly detector inherits from the HalfSpaceTrees object from the river package and has the following inputs

n_trees – defaults to 10 height – defaults to 8 window_size – defaults to 250 limits (Dict[Hashable, Tuple[float, float]]) – defaults to None seed (int) – defaults to None


  def __init__ (self, *args, **kwargs):
      super(). __init__ (*args, n_trees=5, height=3, window_size=5, seed=42, **kwargs)

  def update(self, data):
      self.learn_one({&quot;value&quot;: data[&quot;value&quot;]})
      data[&quot;score&quot;] = self.score_one({&quot;value&quot;: data[&quot;value&quot;]})
      if data[&quot;score&quot;] &gt; 0.7:
          data[&quot;anom&quot;] = 1
      else:
          data[&quot;anom&quot;] = 0
      return self, (
          data[&quot;index&quot;],
          data[&quot;timestamp&quot;],
          data[&quot;value&quot;],
          data[&quot;score&quot;],
          data[&quot;anom&quot;],
      )

flow.stateful_map(&quot;detector&quot;, lambda: AnomalyDetector(), AnomalyDetector.update)
# ((&quot;c6585a&quot;, {&quot;index&quot;: &quot;1&quot;, &quot;value&quot;:0.08, &quot;instance&quot;: &quot;fe7f93&quot;, &quot;score&quot;:0.02}))

# filter out non-anomalous values
flow.filter(lambda x: bool(x[1][4]))

flow.map(lambda x: (x[0], json.dumps(x[1][4])))
flow.output(&quot;output&quot;, KafkaOutput([kafka_servers], kafka_output_topic))

In this dataflow, we first read data from Kafka and deserialize the JSON message. We then normalize the CPU usage values and group them by the instance ID. Next, we apply the AnomalyDetector class inside a stateful operator, which calculates the anomaly score for each data point using HST. We set a threshold for the anomaly score (0.7 in this example) and mark data points as anomalous if their scores exceed the threshold. Finally, we filter out non-anomalous values and output the anomalous data points to a separate Kafka topic.

Using this dataflow, we can continuously monitor EC2 metrics and detect anomalous CPU usage, helping us identify potential issues in our infrastructure.

Creating a Dataflow docker image

dataflow-Dockerfile

FROM bytewax/bytewax:0.16.0-python3.9
RUN /venv/bin/pip install river==0.10.1 pandas confluent-kafka


docker build -f dataflow-Dockerfile -t $REPOSITORY_URI:\dataflow . 


docker push $REPOSITORY_URI:\dataflow

Deploying the Dataflow

To deploy the dataflow, we'll use the Bytewax command-line tool, waxctl. There are two options for deploying the dataflow, depending on how you have set up your Kafka server environment variable. When we deploy our dataflow we will set the processes (denoted by p) to 5 to match the number of partitions we set when we intially created our redpanda topic.

Option 1: Generate waxctl command

Use the following command to generate the waxctl command with the appropriate environment variables:

echo&quot;
waxctl df deploy ./dataflow.py \\
  --name ec2-cpu-ad \\
  -p 5 \\
  -i $REPOSITORY_URI \\
  -t dataflow \\
  -e &apos;\&quot;BYTEWAX_KAFKA_SERVER=$BROKERS\&quot;&apos; \\
  -e BYTEWAX_KAFKA_TOPIC_GROUP_ID=dataflow_group \\
  --debug
&quot;

This will output the waxctl command with the correct Kafka server values. Copy the output and run it to deploy the dataflow.

Option 2: Hardcoded BYTEWAX_KAFKA_SERVER value

If you prefer to hardcode the Kafka server values, use the following command to deploy the dataflow:

waxctl df deploy ./dataflow.py \
  --name ec2-cpu-ad \
  -p 5 \
  -i $REPOSITORY_URL \
  -t dataflow \
  -e &apos;&quot;BYTEWAX_KAFKA_SERVER=three-node-cluster-0.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-1.three-node-cluster.redpanda-bytewax.svc.cluster.local.,three-node-cluster-2.three-node-cluster.redpanda-bytewax.svc.cluster.local.&quot;&apos; \
  -e BYTEWAX_KAFKA_TOPIC_GROUP_ID=dataflow_group \
  --debug

Now that we have deployed our dataflow, after enough time, you'll be able to consume from the anomalies topic to see any anomalies.

rpk --brokers $BROKERS topic consume ec2_metrics_anomalies -o start

As a next step, you could deploy a dataflow to consume from the anomalies and alert you in Slack! Or add rerun like we demonstrated in the previous blog post to visualize the anomalies.

Conclusion

In this blog post, we have demonstrated how to set up a system for monitoring EC2 metrics and detecting anomalous CPU usage. By leveraging tools like Logstash, Redpanda, River, and Bytewax, we've created a robust and scalable pipeline for processing and analyzing streaming data.

This system provides a range of benefits, including:

Efficiently processing high-dimensional and high-velocity data streams
Using the Half Space Trees unsupervised machine learning algorithm for detecting anomalies in streaming data
Continuously monitoring EC2 metrics and identifying potential issues in the infrastructure

With this setup, you can effectively monitor your EC2 instances and ensure that your infrastructure is running smoothly, helping you proactively address any issues that may arise.

That's it! You now have a working cloud-based anomaly detection system using Bytewax, Redpanda, and AWS. Feel free to adapt this setup to your specific use case and explore the various features and capabilities offered by these tools.

Data Council: The Highlights of Day 2

Oli Makhasoeva — Sun, 26 Mar 2023 00:45:10 +0000

Welcome back, data enthusiasts! I'm excited to dive into the second installment of my blog series covering the extraordinary Data Council Conference. If you haven't already, be sure to check out my first post, which provided a comprehensive overview of the engaging talks and workshops from Day 1.

On Day 2, before sessions, we are organizing an informal #StreamBrew coffee gathering for early birds at 7:15 am at KesosTacos near the conference venue. RSVP here. I hope to mingle, network, and enjoy some scrumptious breakfast migas alongside morning coffee. If you've never had migas, don't worry - I haven't either - you won't experiment alone!

Panels

AI Panel

One of the most highly anticipated events on Day 2 of the Data Council Conference is the AI Panel. Though details about the panel's specific focus remain under wraps, the excitement is palpable. I expect a riveting discussion featuring top-tier experts, who will undoubtedly share their unique perspectives on artificial intelligence's current state and future directions. AI changes the world we are living in; it happens almost every week, every month, for sure!

How Investors Think About Data

Another must-attend event on Day 2 is the panel titled "How Investors Think About Data," featuring an impressive lineup of investment professionals. Gain valuable insights from Lauren Reeder, Partner at Sequoia Capital; Slater Stich, Partner at Bain Capital Ventures; Leigh Marie Braswell, Principal at Founders Fund; and Pete Soderling, Founder of Data Community Fund.

I work for a data-oriented startup. And given the current state of the economy, including the infamous SVB disaster, I am curious about what fundraising will look like in the mid-long term and how to maximize our chances to succeed. Also, Pete is the founder and chair of the Data Council conference, and I am eager to hear from him too!

Talks

Day 2 of the Data Council Conference offers three tracks, full schedule is here.

The first track, "Applied & Generative AI," covers topics such as Large Language/Transformer Models, generative AI, product-based implementations of new research methods, and exciting new features powered by machine learning inside products.

The second track, "Analytics," focuses on the latest tools, techniques, and best practices for extracting valuable insights from data. You'll learn how top teams are solving their analytics challenges and discover the best new tools in the process.

Finally, my favorite one, the "Data Culture & Community" track. It emphasizes fostering a vibrant data ecosystem and promoting collaboration among data professionals. Sessions in this track will highlight the role of community building, open-source projects, and knowledge sharing in advancing data science and data engineering.

In case you're torn between multiple sessions like me, remember that many of the presentations will be recorded and made available for viewing later. With that in mind, I will highlight only a fraction of what sparks my interest.

Tristan Zajonc - Generative AI for Product Builders

I always considered no-code or low-code solutions an excellent option for a non-technical (and technical, too, in some cases) founder to build a prototype and get their MVP out there as soon as possible without hiring a bunch of developers. DALL•E, MidJourney, and Stable Diffusion did a similar thing and unlocked creativity for the rest of us. In that light, Tristan's talk about the caveats and nuances of building products using generative AI is very well-timed and relevant.

Thomas Mickley-Doyle "How Vercel Builds Dozens of Metrics from One Heterogenous Table"

I remember quite a few blog posts about the importance of reacting quickly to changes. Partly because Bytewax is enabling real-time ML and because it's a hot topic. Thomas Mickley-Doyle from Vercel will also share their innovative approach to data-driven decision-making. Vercel's strategy has increased stakeholder participation in analytics, reduced troubleshooting time for outlier events, and eliminated the data team as a bottleneck for data-related tasks. Sounds like a lot of fun!

Katrina Riehl "Behind the Curtain: What it Takes to Support the World's Most Popular Open Source Communities"

Dr. Katrina Riehl is President of the Board of Directors at NumFOCUS, Head of the Streamlit Data Team at Snowflake, and Adjunct Lecturer at Georgetown University. If you are building an OOS-driven business or care about how the community perceives your brand (and you better do :)), her talk is a must-go. NumFOCUS is operating on a vast scale: 50 sponsored projects and 60 affiliated projects, including some of the world's most popular open-source projects like NumPy, Scipy, Jupyter, and Pandas. There is definitely a ton to learn from NumFOCUS and Katrina.

I can't wait to share more of the content from the conference itself! I expect no less than an unforgettable experience!

Data Council: The Highlights of Day 1

Oli Makhasoeva — Thu, 23 Mar 2023 05:44:54 +0000

The COVID-19 pandemic has profoundly impacted how we work and learn, and the conference industry is no exception. Many events have moved to virtual formats, allowing attendees to participate from the comfort of their own homes. I even built a business around it! And while I absolutely love virtual events and can talk about their advantages endlessly, there's an undeniable charm to in-person conferences, too.

After *three years * of remote work, I am thrilled to finally attend the Data Council conference in person in Austin and connect with fellow tech enthusiasts face-to-face as soon as next week!

The conference attracts diverse data professionals from various industries, and whilst I've been at events that featured data talks or data tracks and even organized a virtual data-focused conference myself, it's the first time when I have a chance to see so many professionals interested in the latest developments in data engineering, data science, machine learning, and AI.

Come say hi 👋 I'm also bringing Bytewax's swag that you don't want to miss, so let's keep in touch!.

Today I want to share some of the sessions that I found particularly exciting and would like to attend.

I have to split this post because it's too much to cover in one shot; you are reading about Day 1, March 28th.

Agenda

The conference features an action-packed schedule across three days, including regular and lightning talks, workshops, and even speaker office hours. The latter is especially helpful for newcomers to the community (like me), facilitating connections with experts.

Beyond the formal sessions, the conference also offers plenty of opportunities for informal networking (see this thread). We (Bytewax) are organizing #StreamBrew coffee on March 29th in the morning (7:15 AM) and #StreamBrew Beer in the evening on March 30th.

No wonder that with so much to offer this conference is a must-attend event for data folks!

Keynotes

As I said before, the conference's schedule is crowded, and keynotes are no exception. 2 on each day!

Shirshanka Das "Building a Control Plane for Data"

The conference kicks off with an exciting keynote by Shirshanka Das. Shirshanka is a co-founder and CEO of Acryl Data. He will discuss the control plane for data, a harmonizing layer powered by metadata that unifies data discovery, observability, quality, governance, and management. He will describe the fundamental characteristics of a control plane and explain the use cases that can be accomplished with a unified control plane.

I am obsessed with unification and simplification. It brings order and enables teams to work more effectively. Thrilled to hear Shirshanka's thoughts on how to do that for data stacks.

Jordan Tigani "Big Data is Dead"

Next up is Jordan Tigani of MotherDuck with an intriguing title, "Big Data is Dead." The conference's website didn't have a description of the talk at the time I was writing this, but I googled and found a fresh blog post by Jordan.
I have to admit, I was a little skeptical about the title as it sounds like clickbait (unrelated, but I have a background in Scala, and Scala is dead forever and dies every year again and again, so it's not news).

Nonetheless, Jordan is exceptionally qualified to talk about this topic, he shares graphs based on query logs, deal post-mortems, benchmark results, customer support tickets, customer conversations, service logs, and published blog posts. He has his points and I won't post spoilers by citing his blog post. Besides, I am sure he has more to share in his keynote.

Talks

There are three tracks on the day 1:

Data Engineering & Infra
Data Science & Algos
ML Ops & Platforms

It is challenging to choose what to highlight, and I might overlook or forget some talks, so if your favorite one is not on the list, please feel free to let me know on our Slack, or tag us on Twitter or LinkedIn, my DMs are open too.

Chad Sanderson "Data Contracts: Accountable Data Quality."

Chad Sanderson is the Founder of Data Quality Camp, and the Data Quality Camp's Slack is the friendliest place to be. The channels are active, members are helpful, and you can even shamelessly promote whatever you want in the #be-shameless :D

If you're interested in the data contracts, then Chad's talk is definitely worth checking out. He recently posted on his LinkedIn that it's going to be the most in-depth presentation yet on how they implemented data contracts at scale at Convoy.

You also want to attend Data Quality Camp's first-ever in-person happy hour on Monday the 27th at the Stay Put Brewery near the event venue.

Emily Curtin "Extinguishing the Garbage Fire of ML Testing"

The abstract of Emily Curtin's (Staff MLOps Engineer at Intuit Mailchimp) talk resonates with me, I also think that testing should be at the heart and mind of people implementing complex systems. Emily is focusing on testing in MLOps and Data Science, which I need to familiarize myself with, and I look forward to learning about it from her.

I also adore that she says in her bio that she gets paid to say "it depends" and "well actually."

Sophia Yang "How to Interpret & Explain Your Black-Box Models?"

Sophia Yang is a Senior Data Scientist and a Developer Advocate at Anaconda. She is highly knowledgeable about technology and passionate about data science and Python open-source communities.
I think we share many interests, so I'm not missing her talk in which she covers popular model explanation techniques such as explainable boosting machine, visual analytics, distillation, prototypes, saliency map, counterfactual, feature visualization, LIME, SHAP, interpretML, and TCAV.

Jules Damji & Antoni Baum "HuggingFace + Ray AIR Integration: A Python Developer's Guide to Scaling Transformers"

Last but not least, I want to highlight a talk by Jules Damji, who spoke at one of my events before (check out his handmade avatar from the pre Midjourney era). Jules and Antoni will talk about Hugging Face Transformers and Ray AIR. It's cutting-edge Machine Learning, and I'm always willing to discover more about it.

Workshops

At Data Council all workshops are included for free in the cost of your ticket so I will try to attend them too.

Maggie Hays & Paul Logan "URGENT! Help these Pets Find Homes: Working Across Teams in DataHub"

Maggie and Paul's workshop is about Long Tail Companions (a hypothetical pet adoption service). It is in crisis – its data infrastructure has ground to a halt, and they cannot process any adoptions. I care about pets, love fixing failures, and enjoy teamwork. All things combined, it sounds like an excellent session for me.

Erik Edelmann & Meredith Adler "How to Make Marketing Fall In Love with Data Modeling

Data Modeling applied to marketing is obviously something that I care about. I'm joining Erik and Meredith for a demo of the campaign they built at Hightouch. They will cover how the team modeled the data, validated the results, and created a reusable process to support future marketing campaigns.

🎈Community party

The day wraps up with a Community Party at 5:30 pm (kudos to Databand for supporting it).

Don't forget to attend Zander's awesome talk, I'll be giving away awesome swag there!

Also see you at #StreamBrew, RSVP here.

In the next posts I'll cover following days, stay tuned!
See in Austin!

UPD: Day 2