DEV Community: WarpStream

WarpStream Newsletter #5: Dealing with Rejection, Schema Validation, and Time Lag

WarpStream — Fri, 16 Aug 2024 14:12:10 +0000

Welcome to the fifth issue of the WarpStream newsletter where we share our latest blogs on:

Backpressure in distributed systems.
WarpStream schema validation.
Measuring consumer group lag in time instead of offsets.

As well as our most recent product updates. Connect with us on social media and other platforms to stay up to date via the links in the social footer at the bottom of this email.

New Blog Posts

Dealing with rejection (in distributed systems)

At its core, backpressure is a really simple concept. When the system is nearing overload, it should start “saying no” by slowing down or rejecting requests. Of course, the big question is: How do we know when we should reject a request? The latest version of the WarpStream BYOC Agent includes backpressure support for Product and Fetch requests.

Announcing WarpStream schema validation

We are excited to announce that users can now connect WarpStream Agents to any Kafka-compatible Schema Registry and validate that records conform to the provided schema. WarpStream validates not only that the schema ID encoded in a given record matches the schema ID in the Schema Registry, but also that the record actually conforms with the provided schema (field level validation).

The Kafka metric you’re not using: stop counting messages, start measuring time

Monitoring consumer groups can be a challenge. We explain why the usual way of measuring consumer group lag (using offsets) isn’t always the best and show you an alternative approach (time lag) that makes it much easier to monitor and troubleshoot them. The best part? Time lag measurement is built directly into WarpStream — no third-party tooling is needed.

Recent Product Updates

In addition to our blog about schema validation, which we shared above, you can also check out our related doc pages on Agent schema validation and connecting to an external schema registry.

You can upgrade to the latest version of the WarpStream BYOC Agents to benefit from our new-and-improved backpressuring system. Check out our official changelog to see the full list of Agent updates.

Try WarpStream With $400 in Free Credits

WarpStream is free to try. After you create your account, it will be loaded with $400 in free credits so you can test how easy it is to set up and use WarpStream.

Get Started For Free

Connect With Us

LinkedIn X (Twitter) Facebook YouTube Slack Discord

Dealing with rejection (in distributed systems)

WarpStream — Wed, 14 Aug 2024 17:45:51 +0000

by Richard Artoul

Distributed systems: theory vs. practice

There are two ways to learn about distributed systems:

Reading academic and industry papers.
Operating them in production.

The best distributed systems practitioners have done both extensively because they teach you different things.

Traditionally, when people want to learn about distributed systems, they start (as they should) with the literature where they’ll learn about:

Algorithms.
Data structures.
Replication strategies.
Consensus.
Trade-offs between consistency and availability in the face of partition failures.

This helps people build a great foundation, but there are some topics that simply aren’t well covered by the literature. These topics include things like:

Instrumenting the system to make it observable and debuggable.
Maintaining quality of service in the face of multi-tenancy.
Backpressure.
Designing the system to be operable by humans.

I’ve spent the last 10 years of my life operating building and operating distributed systems in production. I’ve been on-call for (almost) every major open-source database on the market:

In addition, I’ve built from scratch (along with my colleagues) and operated several distributed databases in production: M3DB, Husky, and most recently, WarpStream. During this time, I learned a lot of practical knowledge about what it takes to convert a design that works on paper into an implementation that works in production at massive scale.

For example, Husky’s design was heavily inspired by industry-leading systems like Snowflake, Procella, and the wealth of available academic knowledge about leveraging columnar storage and vectorized query processing to analyze huge volumes of data in a short period of time.

But there is so much more that went into making Husky a scalable, cost-efficient, and reliable system than just what can be found in the academic literature. For example, while there are hundreds of excellent academic and industry papers about how to make a highly efficient vectorized query engine, there are shockingly few papers (rooted in actual experience with sufficient detail to replicate) about how to make that query engine multi-tenant, scale to thousands of concurrent queries, with more than 10 orders of magnitude difference in the cost of individual queries, while still maintaining good quality of service and letting your engineers sleep through the night.

That is an incredibly difficult problem to solve, and many teams have solved it (including us when we were at Datadog), but almost no one has documented how they solved it.

Unfortunately for readers stuck deep in the mud of building vectorized query engines, I will not be discussing how we solved that problem at Datadog in this post. Instead, I want to focus on a related problem that we confronted when building WarpStream: backpressure.

Backpressure

Backpressure is one of the most important practical details that every good distributed system has to get right if it’s going to stand a chance at survival in production. Without a good backpressuring system, a small increase in load or an errant client can easily knock over the entire system and leave it stuck in a death spiral from which it will never recover without manual intervention — usually by shutting off all the clients.

At its core, backpressure is a really simple concept. When the system is nearing overload, it should start “saying no” by slowing down or rejecting requests. This applies pressure back toward the client (hence the term) and prevents the system from tipping over into catastrophic failure. This works because, in most real systems, the cost of rejecting a request is several orders of magnitude cheaper than actually processing it.

Backpressure should happen as early as possible in the request-processing lifecycle. The less work the system has to do before rejecting a request, the more resilient it will be.

Of course, the big question is: How do we know when we should reject a request? We could trivially create a system that is incredibly difficult to knock over by only allowing it to process one request concurrently and rejecting all other requests, but that system wouldn’t be of much use to anyone. However, if we start backpressuring too late, well, then our window may have already passed to prevent the system from self-destructing.

Unfortunately, this is one of those scenarios where we need to strive for a difficult-to-quantify Goldilocks zone where the backpressuring system kicks in at just the right time. That’s the art of dealing with rejection (in distributed systems). Let’s clarify by defining what failure and success might look like.

Failure of the backpressuring system is easy to define: it looks like catastrophic failure. For example, if an increase in load or traffic can cause the system to run out of memory, that’s a pretty bad failure of the backpressuring system. In this case, backpressure is happening too late.

Similarly, if the number of requests processed by the system drops significantly below the peak throughput the system is capable of when it isn’t overloaded, that can also be a form of catastrophic failure.

Ideal behavior regarding latency is more use-case dependent. For some latency-critical workloads, an ideal system will maintain consistently low latency for requests it chooses to accept and immediately reject all requests that it can’t serve within a tight latency budget. These use cases are more of an exception than the norm, though, and in most scenarios, operators prefer higher latency (within reason) over requests being rejected.

As a concrete example, Amazon would almost certainly prefer an incident where the latency to add an item to a user’s cart increases from 100ms to 2 seconds over one where 95% of add-to-cart operations are rejected immediately, but those that are accepted complete in under 100ms.

Something like this might be considered acceptable, for example:

Of course, all of these examples are “within reason”. If you recruit a thousand servers to do nothing but DDOS a single server, no amount of intelligent programming will save you if the victim server’s network is completely saturated.

OK, enough abstract discussion; let’s dive into a concrete use case now.

Backpressure in the WarpStream Agents

WarpStream is a drop-in replacement for Apache Kafka® that is built directly on-top of object storage and deployed as a single, stateless binary. Nodes in a WarpStream cluster are called “Agents”, and while the Agents perform many different tasks, primarily they’re responsible for:

Writes (Kafka Produce requests)
Reads (Kafka Fetch requests)
Background jobs (compactions)

Each of these functions needs a reasonable backpressure mechanism, and if a WarpStream Agent is handling multiple roles, those backpressure systems may need to interact with each other. For now, let’s just focus on writes / Produce requests.

Processing every Produce request requires some memory (at least as much data as is being written by the client), a dedicated Goroutine (which consumes memory and has scheduling overhead), and a variety of other resources. If the Agents blindly accept every Produce request they receive, with no consideration for how fast other Produce requests are being completed or how many are currently in-flight, then a sufficiently high number of Produce requests will eventually overwhelm the Agents 1.

Now, queuing theory tends to deal in concepts like arrival rates and average request processing time. That could lead us to consider using a rate limiter to implement backpressure. For example, we could do some benchmarking and conclude that, on average, Agents can only handle X throughput/s/core, and thus configure a global rate limit for the entire process to X * NUM_CORES 2.

That would work, but it would be pretty annoying to tune. What is a reasonable rate-limit for write throughput per core?

We could benchmark it, but the performance will vary heavily from one workload to another. Also, what if we improve the performance of the Agents in the future?
We could measure it empirically at runtime, and then back that out into a dynamically adjusted rate-limit, but that’s likely to be brittle and complex.

In general, rate limits make a lot of sense for limiting the amount of resources that individual tenants can consume in a multi-tenant environment (which is one aspect of backpressuring), but they’re not a great solution for making sure that systems don’t tip over into catastrophic failure.

Instead, I’ve always found the best results by tracking something that correlates with the system becoming overloaded and falling behind. Inevitably, if the system is overloaded, something will begin to pile up: memory usage, inflight requests, queue depth, etc.

These things are easy to track and they relieve us from the burden of thinking in terms of rates (mostly). The threshold for “this is too many things in-flight” is usually much easier to tune and reason about than a pure rate-limit and will automatically adapt to a much wider variety of workloads. As a bonus, if the system gets more efficient over time, the backpressure mechanism will automatically adjust because it will require more load to make things pile up, and so the backpressure system will kick in later. Anytime we can make something self-tuning like this, that’s a huge win.

For Produce requests in the WarpStream Agents, the best criteria we found for triggering backpressure were two metrics:

The number of in-flight bytes that had not yet been flushed to object storage.
The number of in-flight files that had not yet been flushed to object storage.

Intuitively this makes sense: if the Agents are overloaded, then pretty quickly requests will begin piling up in-memory, and the value of those two metrics will spike. It’s pretty easy to do the equivalent of the following in the WarpStream code:

func (h *Handler) HandleProduceRequest(
 ctx context.Context,
 req ProduceRequest,
) (ProduceResponse, error) {
 if h.numberOfOutstandingBytesOrFilesTooHigh() {
  return nil, ErrBackpressure
 }

 // Process request.
}

And we’re done, right?

Unfortunately, this is a woefully inadequate solution. The WarpStream Agents process every Kafka protocol request in a dedicated Goroutine, and there is a pretty big risk here that a flood of Produce requests will come in at the same time, all individually pass the h.numberOfOutstandingBytesOrFilesTooHigh() check at the same time, and then immediately throw the Agent way over the target limit.

We could fix that by making that method atomically check the metrics and increment them, but we actually have a bigger problem: by the time HandleProduceRequest() is called, we’ve already done a lot of work:

Copied the data to be written off the network.
Copied it into temporary buffers.
Spawned (or reused) an existing goroutine.
Emitted a bunch of metrics.

At this point, it’s almost worth just accepting the request because the incremental cost of actually processing the request at this point is not that much higher than the work we’ve already performed.

It would be a lot better if we could have rejected this request earlier. Like, way earlier, before we allowed it to consume almost any memory in the first place. Thankfully, this is possible! We wrote the TCP server that powers the WarpStream Kafka protocol from scratch, so we have full control over it. At a very high level, the server code looks something like this:


for {
 conn = listener.Accept()
 go func() {
  handleConnection(conn)
 }()
}

func handleConnection(conn net.Conn) {
 for {
  header = ReadHeader(conn)
  message = ReadMessage(conn)
  go func() {
   response, err = handler.HandleRequest(message)
   sendOutcome(response, err)
  }()
 }
}

I’m glossing over a lot of details, but you get the gist. In a really busy WarpStream cluster, a single Agent might have thousands or tens of thousands of active connections, and each of those connections will have Goroutines that are reading bytes off the network, allocating messages, and spawning new Goroutines as fast as they can.

In that scenario, it doesn’t matter that the HandleRequest() method will start returning backpressure errors. There’s too much concurrency and the Goroutines handling the Kafka client connections will eventually overwhelm the VM’s resources and trigger an out of memory error.

Ideally, once the Agent detected that it was overloaded, all these connection handler Goroutines would stop processing messages for a while to allow the system to recover. This is the difference between load-shedding and backpressuring. The handler in the above code is shedding load (by rejecting requests), but it’s not applying pressure backward to the rest of the system.

So how do we fix this? Well, the first thing we can do is make a tiny modification to the handleConnection function:


for {
 conn = listener.Accept()
 go func() {
  handleConnection(conn)
 }()
}

func handleConnection(conn net.Conn) {
 for {
  for {
   throttleDuration = handler.ShouldThrottle()
   if throttleDuration > 0 {
    time.Sleep(throttleDuration)
    continue
   }
   break
  }

  header = ReadHeader(conn)
  message = ReadMessage(conn)
  go func() {
   response, err = handler.HandleRequest(message)
   sendOutcome(response, err)
  }()
 }
}

Again I’m oversimplifying, but this is already much better than just the previous solution. Now, it will be much harder for misbehaving clients to knock an Agent over because if the Agent is overloaded, it will stop reading bytes from the network entirely. It’s pretty hard to do less work than that.

Even better, TCP incorporates the concept of backpressure deeply into its design, so this simple trick will apply backpressure back into the networking stack and eventually all the way back to the client VMs.

Finally, we can take this one step further and make the Agents refuse to even accept new connections when they’re overloaded:

for {
 // Stop accepting new connections if overloaded.
 sleepUntilHealthy()
 conn = listener.Accept()
 go func() {
  handleConnection(conn)
 }()
}

func handleConnection(conn net.Conn) {
 for {
  // Stop accepting new requests on existing connections
  // if overloaded.
  sleepUntilHealthy()
  header = ReadHeader(conn)
  message = ReadMessage(conn)
  go func() {
   response, err = handler.HandleRequest(message)
   sendOutcome(response, err)
  }()
 }
}

func sleepUntilHealthy() {
 for {
  throttleDuration = handler.ShouldThrottle()
  if throttleDuration > 0 {
   time.Sleep(throttleDuration)
   continue
  }
  break
 }
}

This is a bit heavy-handed, but that’s OK. It will only kick in during very dire circumstances where the only alternative would be catastrophic failures and/or running out of memory 3.

Sounds good on paper. But does it work? Let’s find out!

At WarpStream we don’t like to coddle our software. Production is a messy place where terrible things happen daily, so we try to simulate that as much as possible in all of our test environments.

One of our most aggressive environments is a test cluster that runs 24/7 with three WarpStream Agents. The benchmark workload is configured such that all three Agents are pegged at 80–100% CPU utilization all the time. The benchmark itself consists of eight different test workloads, using four different Kafka clients, and varying batch sizes, partition counts, throughput, number of client instances, etc.

In total, there are hundreds of producer and consumer instances, thousands of partitions, four different client compression algorithms, a mix of regular and compacted topics, and almost all the producers are configured to use the most difficult partitioning strategy where they round-robin records amongst all the partitions.

In addition, the benchmark workloads periodically delete topics and recreate them, rewind and begin reading all of the data from the beginning of retention for the compacted topics, manually trigger consumer group rebalances, and much more. It’s just absolute chaos. Great for testing!

When we iterate on WarpStream’s backpressure system, we use a very simple test: we aggressively scale the cluster down from three Agents to one. This triples the load on the sole remaining Agent that was already running at almost 100% CPU utilization.

Before our most recent improvements, this is what would happen:

Not fun.

But with the new build and all of our latest tricks?

Not perfect, but it is pretty decent considering the Agent is running at 100% CPU utilization:

Importantly, the struggling Agent immediately recovers as soon as we provide additional capacity by adding a node. That is exactly the behavior you want out of a distributed system like this: the system should feel “springy” such that it immediately “bounces back” as soon as additional resources are provided or load is removed.

Another counter-intuitive outcome here is that the Agent continues to function reasonably even while pegged at 100% CPU utilization for a sustained period of time. This is very difficult to accomplish in practice, but it represents the best case scenario for backpressuring: the Agent is able to utilize 100% of the available resources on the machine without ever becoming unstable or unresponsive.

Any operator (or auto-scaler) can look at that graph and immediately determine the right course of action: scale up! Contrast that with a system that starts backpressuring while the underlying resources are under-utilized (say at 40% CPU utilization). That’s going to be a lot more difficult to understand, debug, and most importantly, react to in an automated manner.

Of course, that’s just how we manage backpressure for Produce requests. The Fetch code path is even more nuanced and required some novel tricks that we’d never employed in any previous system we ever worked on before. But this post is already way too long, so that’ll have to wait until next time!

If you like sleeping through the night and letting your infrastructure auto-scale and protect itself automatically, check out WarpStream.

¹ This is true “by definition” of the most basic principles of queuing theory.

² It’s usually best to define limits as a function of the available resources. This way, the application automatically scales to different instance types without modifying the underlying configuration.

³ Note that all of this is independent of the quota / throttling system that is native to Apache Kafka. We’ll discuss that more in a different post.

To learn more about WarpStream Schema Validation, read the docs, or contact us.

Create a free WarpStream account and start streaming with $400 in free credits. Get Started!

Announcing WarpStream Schema Validation

WarpStream — Thu, 18 Jul 2024 16:54:07 +0000

by Brian Shih

Why do we need schemas?

Schemas in Apache Kafka® enable operators to ensure that their data conforms to the expected schema and prevent data quality and compliance issues, such as rogue producers writing data to Kafka topics that shouldn’t be there. This can be problematic in cases where downstream applications expect only to receive records that conform to a specific schema and a specific format, e.g., ETL applications that write to a database.

Schemas have become ubiquitous in Kafka and are a key component of any data governance, compliance, and platform management regime.

How Schemas Work in Kafka

Historically, schemas in Kafka have been implemented as a client-side feature to reduce the load on the stateful Kafka Brokers. Kafka uses an external server (a Schema Registry) to store schemas, and the producer client periodically polls the registry and caches the schemas and their IDs. Before writing to Kafka, the producer client serializes the data and validates that the record is compatible with the schema retrieved from the Schema Registry. If the record is incompatible with the schema, the serializer throws an error, and the producer does not produce the record for Kafka, which protects against incorrect data being written to Kafka from our producer. If the record is compatible, the producer writes the data with a schema ID and prepends the schema to the record. On the other side, consumers look up the schema from the Schema Registry, and if the schema on the record is compatible with the schema in the Schema Registry, the consumer deserializes the record. If not, the consumer throws an error.

While this implementation satisfies the basic requirement to add schemas to records in Kafka, it lacks broker-side validation, meaning schemas are entirely a client-side feature. That’s problematic because it relies on clients to always do the right thing. The Kafka broker will happily accept whatever it’s given by any Kafka client, so while the client can validate that its own writes and reads conform to the proper schema, there is nothing that prevents another client from writing data that does not conform to the schema defined by a well-behaved producer. Broker-side validation is necessary to implement centralized data governance policies.

Various data governance products have been launched that enable the Kafka broker to do some schema validation, however these features are limited in their utility because they can only validate that the schema ID matches the schema ID from the Schema Registry, not that the schema of the record actually matches the expected schema.

Announcing WarpStream Schema Validation

WarpStream is excited to announce that users can now connect WarpStream Agents to any Kafka-compatible Schema Registry and validate that records conform to the provided schema!

The WarpStream Agent connects to an external Schema Registry

WarpStream Schema Validation validates not only that the schema ID encoded in a given record matches the schema ID in the Schema Registry but also that the record actually conforms with the provided schema. In addition, WarpStream Schema Validation adds a “warning-only” configuration property, which, when enabled, emits a metric to identify rejected records instead of rejecting them, providing easier testing and monitoring during schema migrations without implementing separate dead-letter queues or risking data loss. WarpStream Schema Validation is built into the WarpStream Agent, so the Agent does this validation in the customer’s environment, and no data is exfiltrated.

To connect the WarpStream Agents with a Schema Registry, specify the optional -schemaRegistryURL flag in the Agent configuration. WarpStream supports Basic, TLS, and mTLS authentication between the Agent and the Schema Registry.

Once the Agents are connected to a compatible Schema Registry, WarpStream Schema Validation can be enabled with the following topic-level configurations:

Enabling record-level validation with an external schema registry increases the CPU load for the Agents. But unlike Kafka brokers, which cannot be auto-scaled without significant operational toil and risk of data loss, WarpStream Agents are completely stateless and can be scaled elastically based on basic parameters like CPU utilization. This means that, unlike Kafka, a WarpStream cluster can be scaled automatically on the fly, so there’s no need to permanently provision more Agents in anticipation of increased CPU utilization.

In addition, using Agent Roles, WarpStream can isolate parts of a workload to a specific set of Agents, which reduces the impact of increased load caused by Schema Validation. Schema Validation uses the proxy-produce role, so Agents handling Produce() requests can be isolated from the rest of the cluster and scaled independently.

Currently, WarpStream supports validating JSON and Avro schemas, with support for Protobuf coming soon. While the current implementation of WarpStream Schema Validation utilizes external Schema Registries, we are also currently working on building our own WarpStream-native schema registry.

To learn more about WarpStream Schema Validation, read the docs, or contact us.

Create a free WarpStream account and start streaming with $400 in free credits. Get Started!

The Kafka Metric You’re Not Using: Stop Counting Messages, Start Measuring Time

WarpStream — Tue, 16 Jul 2024 18:03:30 +0000

by Aratz Manterola Lasa

Companion video

Consumer groups are the backbone of data consumption in Kafka. Consumer groups are logical groupings of consumers who work together to read data from topics, dividing the workload by assigning partitions to individual group members. Each group member then reads messages from its assigned partitions independently. Consumer groups also keep track of consumption progress by storing offset positions for every topic partition that the group is consuming. This ensures that when a member leaves the group (because it was terminated or crashed), a new member can pick up where the last one left off without interruption.

Depiction of a Kafka consumer group. Consumers read from their respective partitions, and commit their progress (as Kafka offsets) back to the cluster.

Consumer groups are great, but monitoring them can be a challenge. Specifically, it can be tricky to determine if your consumers are keeping up with the incoming data stream (i.e., are they “lagging”) and, if not, why. In this post, we’ll explain why the usual way of measuring consumer group lag (using Kafka offsets) isn’t always the best and show you an alternative approach that makes it much easier to monitor and troubleshoot them.

The most common way to monitor consumer groups is to alert on the delta between the maximum offset of a topic partition (i.e., the offset of the most recently produced message) and the maximum offset committed by the consumer group for that same topic partition. We’ll call this metric “offset lag.”

Offset lag is the delta between the committed offset and the offset of the last produced record for each topic-partition.

Consumer groups track their own progress using Kafka offsets, so intuitively, it makes sense to reuse the same mechanism to monitor whether they’re keeping up. High offset lag indicates that your consumers can’t keep up with the incoming data, necessitating action like increasing the number of consumers, partitions, or both. In addition, the rate of change of consumer group lag is an important early indicator of potential problems and a good indicator that attempts to mitigate observed increases in lag are working.

The Problem with Consumer Offset Lag

Tracking consumer group offset lag can be a really useful way to monitor an individual Kafka consumer. However, converting offset lag into a value that is meaningful to humans or that can be compared with other workloads is difficult.

Let’s use a concrete example to make this more clear. Imagine you’re an infrastructure engineer responsible for managing your company’s data streaming platform. In a recent incident, one team’s consumer application fell so far behind that customer data was delayed for hours. No monitors were fired, and you only discovered the issue when some of your (rightfully angry!) customers complained.

As a remediation item, you’ve been tasked with ensuring that all Kafka consumers are monitored, so alarms will go off if any consumers fall “too far” behind.

Great! We just learned about the concept of offset lag, so you can create a monitor on the offset lag metric and group by consumer group name, right? All you have to do is pick the offset lag “threshold” beyond which the monitor should fire.

You run the query in a dashboard to see the current values, and you are shocked to find that the current offset lag for your various consumer groups varies wildly, from 10 (no extra zeros!) to 12 million. You freeze in panic. “Are we having an incident right now!?”

Two different consumer groups with wildly varying offset lag.

After some investigation and talking to other teams, you realize this is normal. Some of these consumer groups naturally have much higher throughput than others, so their baseline offset lag is higher because there’s more data “in-flight” at any given moment. Other consumer groups process data in large batches, accumulating large amounts of offset lag, consuming it all at once, and then repeating that process.

Every team’s use case makes sense in isolation, but now you’re stuck. How in the world will you pick one threshold that makes sense for all of these different workloads? You could pick different thresholds for each workload, but even then, you’ll probably get woken up in the middle of the night with false alarms when some of these workloads grow in throughput and their baseline offset lag increases, even if the actual consumers are keeping up just fine.

Time Lag: A More Intuitive Metric

To overcome the limitations of offset-based lag, the Kafka community has introduced a more intuitive metric called “time lag”. While intuitive, this concept wasn’t immediately available in Open Source Kafka’s native tooling. Companies like Agoda and AppsFlyer recognized its value and developed their own solutions, with Agoda notably sharing their insights in a blog post that inspired many in the community (including us!). Since then, tools like Burrow have emerged, offering time lag calculation as part of their Kafka monitoring tools.

Imagine once again that you’re an infrastructure engineer, and you’re in the middle of an incident where one of your consumer groups has fallen behind. Your customers are asking you how delayed their data will be. They’re likely to look at you with a blank stare if you tell them: “your data is delayed 30 million offsets”, but they’ll understand immediately if you tell them the maximum data delay is 17 minutes.

Time lag is calculated using the following function:

Time Lag = CurrentTime — LastTimeConsumedOffsetWasLatest

Where LastTimeConsumedOffsetWasLatest is defined as the moment when the last consumed message was also the most recently produced message.

Let’s illustrate that with an example. Imagine a Kafka topic where:

The latest produced message has offset 15 and was generated at 3:15 PM.
A consumer group processes messages up to offset 10 by 3:20 PM.
The message with offset 11 was produced at 3:10 PM.

In this scenario, LastTimeConsumedOffsetWasLatest is 3:10 PM. This is because at 3:09:59 PM, offset 10 was still the latest message on the topic. However, at 3:10 PM, offset 11 was produced, meaning the consumer started to fall behind at that exact moment. So we round this up to 3:10 PM.

Therefore, at 3:20 PM, the time lag is calculated as:

Time Lag = 3:20 PM — 3:10 PM = 10 minutes

This means the consumer group is 10 minutes behind the most recent message.

Another way to put it: “ time lag” is the time elapsed since the next-message-to-be-consumed was produced. This definition is simple but also deceptively elegant: by making it relative to the current time, the metric keeps increasing if there are any unprocessed records, even if producers and consumers have stopped entirely. It acts as an alarm, alerting you to unprocessed messages even when the system appears idle.

An Integrated Approach to Time Lag Calculation

While monitoring time lag can be a game-changer, accessing this metric isn’t always straightforward. If you search online resources, you’ll find the primary method involves third-party tooling that calculates this metric for you, like Burrow. These tools are great; they really make monitoring trivial. However, Burrow is yet another piece of software that has to be deployed, maintained, and troubleshooted.

At WarpStream, we like to make things easy. Asking our users to install third-party tooling just to know if their consumer applications were caught up didn’t sit right with us. So, we decided to build time lag measurement directly into WarpStream so that all our users would benefit from it out of the box.

This is probably a good time to briefly review WarpStream’s architecture. If you’re not familiar with it, WarpStream is a drop-in replacement for Apache Kafka built directly on top of object storage. WarpStream has many different architectural differences from Apache Kafka, but the one most relevant to the current topic is that in addition to separating computing from storage, WarpStream also separates data from metadata.

WarpStream architecture diagram. Agents (stateless thick proxies) run in the customers cloud account, and the metadata store runs in WarpStream’s cloud account.

Customers’ raw data is stored exclusively in their own S3 buckets, accessible only to them. Meanwhile, WarpStream Cloud stores metadata in a highly available, quadruply-replicated metadata store.

The fact that WarpStream stores all of the cluster’s metadata in a centralized metadata store makes calculating time lag (relatively) straightforward. Unlike Apache Kafka, we don’t have to read or load any raw records or their headers; we can just query the timestamp index in the metadata store directly. This has the added benefit that it doesn’t rely on potentially unreliable record header timestamps (the client can set custom timestamps in the records). Instead, WarpStream maintains its own accurate timestamps in the metadata store and uses optimized data structures for time-based searches.

There was one challenge we had to solve, though: metrics are published by the Agents (data plane), which run in the customer’s environment and expose metrics via a Prometheus endpoint. However, the time lag calculation was running in WarpStream’s cloud control plane, so we needed a mechanism to make the time lag metrics the control plane generated available as Prometheus metrics in the Agents.

To solve this, we came up with a very simple solution: leverage WarpStream’s existing job queueing system. WarpStream’s architecture includes a centralized scheduler on the control plane that orchestrates various operational tasks. Agents, deployed within the customer’s environment, regularly poll this scheduler to receive and execute tasks, including functions like data compaction and object storage cleanup. Leveraging this existing infrastructure, we introduced a new job dedicated to calculating time lag metrics. This job runs on the control plane, periodically computing the metrics and making them accessible for the agents to retrieve during their polling cycles, who then emit them. We liked this solution because it’s simple and allows us to provide more metrics easily in the future.

We leverage this metadata to provide the warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds metric. Why such a weird name? As you’ll see later in our more detailed explanation, this value is coarse-grained and imprecise. For example, while the actual end-to-end latency for a workload may be 500ms, this metric could report that the consumer group time lag is as high as 5 seconds. We wanted to clarify that while this metric is valuable for monitoring and alerting, it should not be used for benchmarking.

This metric is an approximation, so it’s not perfect, but it’s great for getting a general idea of how things are going and catching bigger problems. If an incident happens and someone tries to use this metric to explain a one-millisecond delay, they’re using the wrong tool for the job. We want people to feel comfortable setting alerts for more substantial delays (e.g., several minutes) because this metric excels at that. Think of it as a coarse-grained tool for catching big problems, not a fine-tuned instrument for performance tuning.

Offset lag for the blue consumer is more than 20x higher, but time lag is less than 2x higher.

The graph above showcases the difference between offset lag and time lag for two consumer groups. One group has a much larger offset lag of 20,000, while the other has a smaller lag of a few hundred. However, when we switch to the time lag, we see a different picture: both groups have very similar lags of 2 and 4.5 seconds. This shows how offset lag alone can be misleading and how time lag provides a more understandable overview of consumer group health.

Imagine trying to set alerts based on these metrics. With time lag, a single alert threshold (e.g., 2 minutes) could easily cover both consumer groups. With offset lag, you’d need to set different thresholds for each, carefully considering the nature of each workload and potentially missing alerts for the group with the “smaller” lag.

Behind the Scenes: The Mechanics of Time Lag Metrics

Having established the benefits of time lag over offset lag, let’s delve into the technical implementation. Understanding this implementation will also show how WarpStream calculates the time lag we introduced earlier: Time Lag = CurrentTime—LastTimeConsumedOffsetWasLatest.

WarpStream continuously tracks when messages are produced, associating each message offset with its corresponding timestamp. This data is stored internally in a way that allows us to efficiently query for offsets based on timestamps. To optimize storage, we aggregate this data into minute-level intervals. For each minute, we record the earliest offset produced (baseOffset) and the total number of offsets produced (offsetCount), effectively creating a compact time-series representation of message production.

When we need to know LastTimeConsumedOffsetWasLatest for a specific offset consumed by a consumer group, we use this index:

We first locate the relevant minute-level interval that contains that offset.
Within that interval, we divide the time by the offsetCount to estimate how frequently messages were produced within that time range.
Using the production rate and the offset’s position within the interval, we calculate an estimated timestamp for when that specific message was produced. This gives us the LastTimeConsumedOffsetWasLatest, which we then subtract from the current time to obtain the time lag.

As mentioned earlier, a dedicated background job within WarpStream’s control plane periodically calculates each cluster’s time lag and other relevant metrics for every consumer group. This involves querying the committed offsets for every consumer group and partition and then utilizing the timestamp index to compute the corresponding time lag values. These calculated values are subsequently transmitted to the WarpStream Agents operating within the customer’s environment. And finally the Agents expose these time lag metrics via their Prometheus endpoint, under the name warpstream_consumer_group_estimated_lag_very_coarse_do_not_use_to_measure_e2e_seconds.

However, keeping a record of every minute would consume excessive storage space for clusters with many topic-partitions and high (or infinite) retention. To address this, we merge the index entries periodically. This involves merging multiple entries into one, updating the baseOffset and offsetCount, and introducing an additional field called minuteCount to keep track of the number of minutes the compacted entry represents.

This merging does sacrifice some timestamp precision, but we prioritize the most recent entries, ensuring we maintain their original accuracy untouched. Older entries are the only ones subject to merging. We prioritize recent entries because the more recent an offset is, the more crucial it is for consumers to have precise lag information. If a consumer is 10 minutes behind, a 30-second difference isn’t a major concern. But for a consumer who’s only 1 minute behind, that level of precision becomes much more important. In this way, we balance optimizing storage efficiency and maintaining the level of precision that matters most for effective monitoring.

Now, it’s clear why this metric is an approximation designed for monitoring and alerting, not precise benchmarking. The “very coarse” part of the metric’s name highlights a few key limitations:

Interpolation: The metric is calculated by interpolating at least 1-minute level entries in the timestamp index, which can introduce inaccuracies compared to the true message production time.
Committed Offsets: The metric relies on committed offsets, which may only sometimes reflect the most up-to-date consumption progress. Consumers can commit offsets at varying intervals, either immediately after processing a message or after processing an entire batch. This leads to potential discrepancies between the committed offset and the actual latest consumed message.

These factors make the metric less suitable for precise performance measurements but perfectly adequate for identifying significant delays in consumer group processing. Moreover, the utility of the timestamp index extends beyond just calculating the time lag. It also enables internal Kafka APIs to query for offsets based on specific timestamps, which is useful for features like time-based data retrieval and analysis.

Depiction of timestamp index compaction and subsequent interpolation.

In conclusion, monitoring Kafka consumer groups doesn’t need to be a guessing game. By shifting the focus from the message counts (offset lag) to time (time lag), understanding how consumers perform becomes trivial. With Warpstream’s built-in time lag metrics, this insight is readily available, ensuring you can monitor and react timely in case your data pipeline consumers start to fall behind.

Create a free WarpStream account and start streaming with $400 in free credits. Get Started!

WarpStream Newsletter #4: Data Pipelines, Zero Disks, BYOC and More

WarpStream — Wed, 10 Jul 2024 17:20:40 +0000

Welcome to the fourth issue of the WarpStream newsletter. A lot has happened since our last newsletter: we’ve released five new blogs, made a bunch of product updates, and added new social channels (like Facebook and YouTube. Connect with us on social media and other platforms to stay updated via the links in the social footer at the bottom of this email.

Lots of New Blog Posts

Introducing WarpStream Managed Data Pipelines for BYOC clusters

For WarpStream BYOC clusters, Managed Data Pipelines provide a fully-managed SaaS user experience for Bento, a lightweight stream processing framework that offers much of the functionality of Kafka Connect, without sacrificing any of the cost benefits, data sovereignty, or deployment flexibility of the BYOC deployment model and comes with version control.

Pixel Federation Powers Mobile Analytics Platform with WarpStream, saves 83% over MSK

Pixel Federation’s mobile games have millions of users, so you can imagine how many events and Kafka topics they have. By swapping MSK for WarpStream, they not only drastically reduced their costs, but were able to ditch complex VPC peering in favor of simpler agent groups.

Interested in Learning More About WarpStream?

Book a call

Zero Disks is Better (for Kafka)

In a prior blog, we discussed how tiered storage won’t fix Kafka. The end goal is not some disks but zero disks. We cover how WarpStream’s Zero Disk Architecture (ZDA) allows you to do things like trivial or dead-simple auto-scaling of Kafka brokers (“agents” in WarpStream terminology), isolate workloads with agent groups, and easily run your entire data pipeline in your virtual private cloud (VPC) without the need for custom code or additional services.

Secure by default: How WarpStream’s BYOC deployment model secures the most sensitive workloads

WarpStream’s BYOC model is a hybrid approach that balances the two common cloud deployment models (fully self-managed and fully hosted SaaS). By splitting the software into discrete data and control planes, it ensures data privacy and sovereignty, compliance, cost optimization, and control.

Try WarpStream With $400 in Free Credits

Get Started For Free

Multiple Regions, Single Pane of Glass

A common problem when building infrastructure-as-a-service products is the need to provide highly available and isolated resources in many different regions while also having the overall product present as a “single pane of glass” to end-users. We review the options available to solve this and what we ultimately used (pushed-based replication).

Managed Data Pipelines

BYOC customers can now use Managed Data Pipelines. These combine the power of WarpStream’s control plane with Bento, an open-source streaming processing platform.

This provides much of the same functionality as Kafka Connect and additional stream processing functionality like single message transforms, aggregations, multiplexing, enrichments, and native support for WebAssembly (WASM).

Pipelines run in your VPC and on your VMs, and data is processed in your buckets. WarpStream has zero access to this data. WarpStream provides a helpful UI for creating and editing pipelines, the ability to pause and resume pipelines dynamically, and version control.

Lots of New Metrics

We’ve added new metrics (and deprecated unnecessary ones) with nearly every release. We’ve recapped some of these new metrics below. You can check out our official changelog to get the full list.

warpstream_consumer_group_generation_id = This metric indicates the generation number of the consumer group, incrementing by one with each rebalance. It serves as an effective indicator for detecting occurrences of rebalances.
warpstream_agent_kafka_fetch_uncompressed_bytes = Tracks the total uncompressed bytes fetched, replacing warpstream_agent_kafka_fetch_bytes_sent metric.
warpstream_consumer_group_generation_id = Uses the consumer_group tag. This metric indicates the generation number of the consumer group, incrementing by one with each rebalance. It serves as an effective indicator for detecting occurrences of rebalances.

Coming Soon: Kafka Transactions

As we announced in our previous newsletter, the team is working on building in support for Kafka Transactions and expects to finish this work soon. If you want to use WarpStream for a workload requiring Transactions, please contact us! We would love to chat.

Try WarpStream With $400 in Free Credits

WarpStream is free to try. After you create your account, it will be loaded with $400 in free credits so you can test how easy it is to set up and use WarpStream.

Get Started For Free

Multiple Regions, Single Pane of Glass

WarpStream — Fri, 21 Jun 2024 17:01:44 +0000

by Emmanuel Pot

Multiple Regions, Single Pane of Glass

A common problem when building infrastructure-as-a-service products is the need to provide highly available and isolated resources in many different regions while also having the overall product present as a “single pane of glass” to end-users. Unfortunately, these two requirements stand in direct opposition to each other. Ideally, regional infrastructure is, well, regional, with zero inter-regional dependencies. On the other hand, users really don’t want to have to sign into multiple accounts/websites to manage infrastructure spread across many different regions.

When we first designed how we would expand WarpStream’s cloud control planes from a single region to many, we searched around for good content on the topic and didn’t find much. Many different infrastructure companies have solved this problem, but very few have blogged about it, so we decided to write about our approach and, perhaps more importantly, some of the approaches we didn’t take.

Let’s start by briefly reviewing WarpStream’s architecture by tracing the flow of a single request through the system. An operation usually begins with a customer’s Kafka client issuing a Kafka protocol message to the Agents, say a Metadata request. Since Kafka Metadata requests don’t interact with raw topic data like Produce and Fetch do, they can be handled solely by the WarpStream control plane. So when the WarpStream Agents receive a Kafka Metadata request, they just proxy it directly to the control plane.

WarpStream Agents deployed in a customer cloud account, sending metadata requests to WarpStream’s Metadata Store.

The request will hit a load balancer and then one of WarpStream’s “Gateway” nodes. The Gateway node’s job is to perform light authentication and authorization (basically, verify the request’s API key and map it to the correct customer / virtual cluster), and then forward the request to the Metadata Store for this customer’s cluster.

Based on this, it’s already clear that WarpStream’s control plane has to deal with two very different types of data:

Platform data : everything that users can control from our web console and APIs: users, clusters, API keys, SASL credentials, etc. This data is persisted in a primary Aurora database that runs in us-east-1 and changes very infrequently.
Cluster metadata : all the metadata that enables WarpStream to present the abstraction of Kafka on top of a low-level primitive like commodity object storage. For example, the Metadata Store keeps track of all the topic-partitions (and offsets) that are contained within every file stored in the user’s object storage bucket.

These two different types of data have very different requirements. The cluster metadata is in the critical path of every Kafka operation (both writes and reads), and therefore must be strongly consistent, extremely durable, highly available, and have low latency. As a result, we run every instance of the Metadata Store in a single region, whichever region is closest to the user’s WarpStream Agents. We also run each instance of the Metadata Store quadruply replicated across three availability zones, and we never replicate this metadata across multiple regions (for now).

The requirements for the platform data, on the other hand, look completely different. This data changes infrequently, and the data being slightly stale is of no consequence (eventual consistency is ok). While platform data like API keys are technically required in the critical path, since they’re trivially cacheable for arbitrarily long periods of time, they’re not really in the critical path. Also, unlike the cluster metadata, some of the platform data needs to be available in multiple regions for the service to function as a single pane of glass.

When we were evaluating how to add support for additional regions to WarpStream, there wasn’t much to think about for the virtual cluster Metadata Stores. We would just run dedicated instances of it in more regions, and users would connect their Agents to whichever region was closest to their Agents since most (but not all) WarpStream clusters run in a single region anyway.

The platform data (like API keys) is a different story. We could have used the same approach we did with the Metadata Store for the platform data by running a dedicated (and fully isolated) Aurora instance in every region, but that would have resulted in a poor user experience. Every region would have presented to users as a fully independent “website,” and users who wanted to run clusters in multiple regions would have had to maintain different WarpStream accounts, re-invite their teams, configure billing multiple times, etc, which is not what we wanted.

Hub and Spoke

When we looked at these requirements, the architecture that seemed like the best candidate was a “hub and spoke” model. The us-east-1 region that hosts our Aurora cluster would be the primary “hub” region that hosts the WarpStream UI and all of our “infrastructure as code” APIs for creating/destroying virtual clusters. All the other regions would be “spokes” that run fully independent and isolated versions of WarpStream’s Metadata Store, but not the Aurora database that stores the “platform data”.

Three spoke regions running fully isolated Metadata Stores powered by platform data replicated from the hub region.

CRUD operations to create and destroy virtual clusters would always be routed to the hub region, but actual customer WarpStream clusters and their Agents would only ever interact with a single “spoke” region and have no cross-regional dependencies.

This approach would give us the best of both worlds: a single pane of glass where WarpStream customers could manage clusters in any region while still keeping regions independent from each other such that a failure in one region (including the hub region) would never cause a failure in any other region. The one caveat with this approach is that any unavailability of Aurora in the primary hub region would prevent customers from creating new clusters in all regions, but existing clusters would continue working just fine. We felt like this was an acceptable trade-off.

However, this architecture did present a conundrum for us. In order for our product to present as a single pane of glass, some of the data in our primary region (like whether a virtual cluster exists, whether an API key was valid, etc) had to be made available in all of our spoke regions.

The hub region can read the platform data from the primary Aurora database, but where do the spoke regions read it from?

But we also needed to avoid creating any critical path inter-regional dependencies. Whatever we ended up doing, we had to ensure that the failure of a single region could never impact clusters running in different regions.

Easier said than done!

Option 1: Multi-Region Aurora

The first option we considered was to leverage AWS Aurora’s native multi-region functionality. Specifically, AWS Aurora has support for spawning read replicas in other regions. There are limits on how many additional regions can contain read replicas, and this approach would only work with AWS so we’d need a different solution for multi-cloud, but we thought this solution could be a good enough stop-gap in the short term to scale from a single region to a handful without much engineering work. We also really liked the idea of offloading the tricky problem of replicating a subset of our platform data to the AWS Aurora team.

Multi-region AWS Aurora cluster with the primaries in the hub region and read replicas in the spoke regions.

Unfortunately, when we investigated further, we discovered that any unavailability of the primary Aurora region could result in the unavailability of the secondary region read replicas. If that ever happened to us, we’d end up in a terrible situation: all of our spoke regions and their associated clusters / Metadata Stores would still be running (thanks to the in-memory caches), but restarting or deploying the control plane nodes would cause an incident due to the in-memory caches being dropped and unable to be refilled.

It turns out that multi-region functionality in Aurora is designed for a completely different use case: failing over regions fast when the primary region fails. Useful for that situation, but we wanted a solution that would never require manual intervention, so we ruled it out.

We briefly considered migrating to a different database with better multi-region availability support like CockroachDB or Spanner, but we had no previous experience with these technologies, and migrating all of our platform data to a brand-new database technology felt like overkill.

Option 2: Smart (and durable) Caches

Luckily, the platform data (like which virtual clusters exist and which API keys are valid) changes infrequently. So, another approach we considered was to query the source of truth (our us-east-1 Aurora database) from all subregions and then cache that data aggressively. For example, the first time a gateway node encountered an API key that it had not seen before, it would query Aurora in us-east-1 to determine if it was valid and then cache the result in memory.

ap-southeast-1 spoke region querying the AWS Aurora cluster in us-east-1 to fill its in-memory caches.

This approach was appealing because it would require only minimal code changes, and it took advantage of a strategy we were already employing within the primary region to be resilient against Aurora failures: in-memory caching. The Gateway nodes were already using a custom “smart” cache (internally referred to as the “loading cache”) that would fit the bill perfectly. This cache employs a number of tricks to make it suitable for critical use cases like this:

It automatically deduplicates cache fills. This eliminates the thundering herd problem.
It incorporates negative caching as a first-class concept, so if the Gateway nodes keep receiving requests for API keys that no longer exist, they don’t keep querying Aurora over and over again.
It limits the concurrency of cache fills so that a flood of requests with new and unique API keys results in the cache being filled at a continuous (and manageable) rate instead of flooding Aurora with queries.
It implements asynchronous background refreshes (again, with limited concurrency) so that changes in Aurora (like invalidating an API key) are eventually reflected back into the state of the in-memory caches. This ensures that in normal circumstances, when Aurora is available, invalidating an API key is reflected within seconds, but in rare circumstances where Aurora is unavailable, the API gateway nodes can keep running more or less indefinitely as long as they aren’t restarted.

This smart caching strategy had served us well within our primary region, but ultimately we decided that it wasn’t an acceptable solution to our multi-region data replication problem. A failure of the primary Aurora database in us-east-1 wouldn’t immediately impair the other regions, but it would leave us unable to deploy or restart any of the control planes in our other regions until the availability of the Aurora database was restored. In other words, this approach suffered from the same problem as the Aurora read replicas approach.

Briefly, we considered extending our existing loading cache implementation to be durable so that we could restart control plane nodes, even when the primary Aurora database was down, without losing the data that had already been cached.

ap-southeast-1 spoke region querying the AWS Aurora cluster in us-east-1 to fill its in-memory caches, but then persisting the cached data to a local DynamoDB instance so that the Gateway nodes can still be restarted safely even if the primary Aurora cluster is unavailable.

However, we also decided against that approach because it didn’t feel very stable. The system would function completely differently when the primary Aurora database was available than when it was unavailable, and we didn’t like the idea of relying heavily on an infrequently exercised code path for such critical functionality.

Ultimately, we decided that while the loading cache was great for caching data within a region, it was not an acceptable solution for replicating data across regions.

Option 3: Push-Based Replication (we chose this one)

Both of the models we considered previously were “pull-based” models. Instead, we decided to pursue a “push-based” approach using a technique we’d learned at previous jobs called “contexts”. A Context is a bundle of metadata with the following characteristics:

Its values change slowly (if at all).
The metadata is associated with specific clusters or tenants.
The metadata needs to be made available on a large number of machines in a highly available manner.
Availability is always favored over consistency, i.e., we’d rather use values that are several hours old than have the system fail entirely.

For example, one of the contexts we created is called the “cluster context” and it contains:

The cluster’s ID
The cluster’s name
The ID of the tenant (customer) the cluster belongs to
A few additional internal fields are required for the Metadata Store to begin processing requests

Building the contexts was straightforward. We wrote a job that scans the Aurora database every 10 minutes, builds the contexts, and then writes them as individual key-value pairs to a durable KV store in the relevant spoke regions.

Context publisher replicates context from the hub regions to the spoke regions.

Of course, we pride ourselves on the fact that a new WarpStream cluster can be created in under a second, so forcing users to wait 10 minutes before their clusters were usable after creation wasn’t acceptable. Solving for this was easy though: when a new cluster is created (or any operation is performed that could result in a context being created or an existing one mutated), we submit an asynchronous request to the same job service that will trigger an update for that specific context immediately.

This gives us the best of both worlds. Changes to the contexts (like a new cluster being created or an API key being revoked) are reflected in their associated subregions almost instantaneously, but in the worst-case scenario where we forget to issue the async update request in some code path (or it fails for some reason), the issue will automatically resolve itself within a few minutes. In other words, this approach is fast in the happy path, and self-healing in the non-happy path. Simple and easy to reason about.

The primary downside of this approach is that it was a lot more work to implement. But we think it was worth it for a few reasons:

We truly have zero inter-regional dependencies in the critical path. Instead, the primary region pushes updates to the sub-regions proactively, but the sub-regions never query the primary region or create any external connections. In fact, the spoke regions aren’t even aware of the hub region in any meaningful way. This makes reasoning about availability, reliability, and failure modes easy. We know the failure of one region will never impact other regions because no region takes dependencies on another region, so it can’t have any impact by definition.
The context framework we created is broadly useful. For example, in the future we’ll use it to build out support for our own feature flagging system without taking on any additional external dependencies.

With this setup, we have been able to deploy our control plane in three additional new regions all over the world, and we would be ready to spawn more depending on customers’ needs.

Secure by default: How WarpStream’s BYOC deployment model secures the most sensitive workloads

WarpStream — Mon, 10 Jun 2024 17:21:30 +0000

by Caleb Grillo

Fundamentals of BYOC

WarpStream’s Zero Disk Architecture

Typically, cloud data infrastructure products follow one of two deployment models:

Fully self-managed, where the customer purchases a software license and support but is ultimately responsible for deploying and managing the software themselves.
Fully-hosted SaaS model in which the vendor manages all the infrastructure in their own cloud environment and the customer simply receives an endpoint.

The Bring Your Own Cloud (BYOC) deployment model is a hybrid approach to cloud infrastructure that strikes a balance between these two extremes. Generally, it works like this: the software is split into two different components, a “data plane” (compute + storage) and a “control plane”. The control plane runs in the provider’s environment, and the data plane runs in the customer’s environment.

This deployment model has several benefits:

Data privacy: Because the data never leaves your environment, you have greater control over who has access to it and under what circumstances.
Data sovereignty: Data is always stored on resources that you control, so you don’t need to worry about data finding its way to geographical regions where it shouldn’t be.
Compliance: The data plane is deployed in the customer environment, so strict compliance requirements can be fulfilled, and all traffic can be audited.
Cost optimization: Because the infrastructure runs in your environment, you can control factors like instance types, networking configurations, and storage classes to optimize costs. You can also take advantage of committed use discounts, reserved instances, and savings plans to further optimize your costs. And perhaps most importantly for a networking-heavy system like Apache Kafka®, the combination of this deployment model and WarpStreams Zero Disk Architecture eliminates virtually all networking fees which can often account for more than 80% of the TCO of a traditional Kafka deployment..
Control: You have control over the infrastructure that you deploy the software on, so you can choose your own networking topology, instance types, security settings, and storage services that you use.

BYOC makes a lot of sense for mission-critical, data-intensive, and networking-heavy systems like Kafka where throughput is often measured in the hundreds or even thousands of MiBs per second. But historically, BYOC for Kafka has been limited to niche use cases because Kafka (and other equivalent systems) are so stateful and difficult to manage that remotely administering them is almost impossible.

The problem with BYOC for Kafka

Kafka and its derivatives have stateful architectures, with local disks that store partitions that need to be actively managed in order to prevent a variety of issues like: hot partitions, unbalanced storage, under-replicated topic-partitions, etc. This is why there are so many vendors offering a fully-managed Kafka solution, but relatively few that offer a BYOC variant. Managing Kafka in your own environment is difficult enough, but managing Kafka in someone else’s environment is even more challenging.

Since Kafka clusters need to be constantly managed, existing BYOC deployment models for Kafka require providing the vendor with high level access to your environment so their personnel can keep the cluster healthy and mitigate incidents when they inevitably occur. The BYOC vendor often has the ability to manage a huge range of cloud infrastructure, including security policies and resources for your VPC, service accounts, subnetworks, IAM roles, firewall rules, and storage buckets.

But wouldn’t it be better if external access wasn’t required at all?

Zero Access BYOC, secure by default

WarpStream’s primary deployment model is BYOC, but it works a little bit differently from the rest. Unlike most BYOC deployment models, WarpStream was designed to operate with no access to the environment that the Agents and object storage are running in. The only requirement for running the WarpStream Agents is that they have permission to access an object storage bucket in which they can store data, and that they have the ability to establish an outbound connection to the WarpStream Cloud control plane. That’s it. No IAM roles or permissive security policies are required.

This is possible because WarpStream was designed from the ground up with a Zero Disk Architecture with not only full separation of compute and storage, but also separation of data from metadata. This architecture makes managing the WarpStream Agents trivial. The Agents are just stateless compute, so there are no leader elections, no partition rebalances, no disk resizing, and no manual operations required to keep the cluster healthy. WarpStream clusters can be seamlessly scaled in, out, up, or down, with virtually no effort, just like a traditional web server.

WarpStream leverages this Zero Disk Architecture to provide a very high level of service with very little external control by using a shared responsibility model that separates storage, compute, and metadata.

The cloud provider manages the storage, the customer manages the stateless compute (I.E the WarpStream Agents), and WarpStream Cloud manages the metadata / consensus layer. This means that only metadata is transferred from your environment to WarpStream’s, and no raw data ever leaves your environment.

Of course, this zero-access BYOC deployment model does have a tradeoff: WarpStream users are responsible for managing their own (stateless) compute. Fortunately, this is the one thing that everyone running software in the cloud knows how to do: deploy and scale stateless containers! Of course, we do our best to make this easy by providing infrastructure as code primitives like our Terraform Provider and Helm chart.

In exchange for assuming responsibility for deploying and managing the stateless Agents, WarpStream’s users get a deployment model that exposes them to far less risk of a data breach than any other cloud-native model. By design, there is no way for WarpStream Cloud to access your data, even if WarpStream’s cloud account was breached by a hostile actor, or WarpStream was compelled by a government agency.

In fact, WarpStream’s security model is so strong that we even have customers using it for their production workloads in AWS GovCloud regions. While no system can credibly claim to be 100% safe, WarpStream’s design lends itself to a stronger security posture than any of the BYOC products that came before.

Zero trust makes BYOC safe

WarpStream makes the BYOC model safer and more secure than any alternative. This was a deliberate design choice that was made possible by WarpStream’s Zero Disk Architecture. With a zero-trust BYOC model, our customers truly get the best of both worlds: an (almost) fully managed user experience, but with all of the cost and security benefits of running on their own infrastructure.

To learn more about WarpStream’s secure-by-default BYOC deployment model, contact us. Or, if you’re ready to get started, you can sign up and get up and running with WarpStream in just a few minutes. No credit card is required to get started, and your first $400 is on us.

Announcing Bento, the open-source fork of the project formerly known as Benthos

WarpStream — Fri, 31 May 2024 17:36:25 +0000

by Richard Artoul

tl;dr

Redpanda announced yesterday that they’ve acquired Benthos and immediately made sweeping changes to the project: commercially licensing some of the most important integrations, redirecting all Benthos sites to Redpanda sites, and rebranding the Discord community to Redpanda. So TL;DR — we are (reluctantly) forking the Benthos project, and maintaining it as a 100% free MIT-licensed open-source project, just like Benthos was before the acquisition.

Love at First Blob

I first discovered the Benthos project serendipitously. I was browsing the /r/apachekafka subreddit and someone mentioned that they were using Benthos — a lightweight stream processing framework written in Go — as a simpler and more lightweight alternative to Kafka Connect. I was immediately intrigued. Kafka Connect is one of those projects in the data streaming space that everyone uses and everyone hates. A simpler and more performant alternative written in Go, WarpStream’s native language, immediately piqued my interest.

I Googled the project and fell in love with it pretty much right away. Unfortunately benthos.dev now redirects to a Redpanda docs site, but if you’ve never seen the original Benthos docs before, do yourself a favor and check out the old site. The home page (to me at least) is perfect: crystal clear concise messaging with a clear explanation of the value proposition, but also cute and hilariously entertaining in the best way. If that doesn’t immediately make you a fan of the project’s primary author Ashley Jeffs, then check out his Benthos rap video:

After meeting Ashley in person at Kafka Summit London, we decided to bet on Benthos as the connect layer for WarpStream. We sponsored the project for the maximum amount allowed and (after asking and receiving Ashley’s explicit permission) embedded Benthos directly into the WarpStream Agents to make it easy to integrate WarpStream with other systems and perform common lightweight stream processing tasks without needing to run any extra infrastructure. Then, a few weeks later, we launched Managed Data Pipelines, which brings a fully-managed model for managing streaming pipelines end to end, using the Benthos framework that we had already embedded into the Agents’ single Go binary.

We know Ashley well and loved working with him. He’s done an incredible amount of work over the last 7 years to build and maintain Benthos. He personally made nearly 3,500 commits on the Benthos repo, built a strong community, and most importantly, wrote some amazing software. He’s earned every cent that Redpanda paid him for the acquisition, and we couldn’t be happier for him.

The Acquisition

When we heard that Redpanda was going to acquire Benthos, we thought they were going to continue developing the project the same way (and under the same license) that Ashley had for the last 7 years, and that they would incorporate the already-proprietary Benthos Studio into their product. Instead, in less than 12 hours they:

Changed the name of the project from Benthos to “Redpanda Connect”, and prohibited anyone from using the term “Benthos.” 1
Posted messages in both the Benthos Discord server and the #benthos channel in the Gophers Slack community encouraging community members to migrate to Redpanda’s Slack community instead.
Rebranded the Benthos Discord server to “Redpanda”.
Moved the Benthos Github repo to Redpanda’s repo, and split it into two repos with two different licenses.
Started relicensing some of the most critical integrations and connectors as proprietary2 under a completely different license, including some of the integrations that were written by open source contributors who were not involved in the acquisition.

In just a few hours, Redpanda took a 7 year old open source project with nearly 8,000 stars on GitHub, hundreds of contributors, and thousands of users and began transitioning it to a proprietary software model.

We don’t really care what the GitHub repository is called, or how the Discord server is branded. In fact, if you go to our docs, you’ll see Redpanda Console displayed prominently as an integration because it’s a great product that helps our users get the most out of WarpStream, and we’re not afraid to give credit where credit is due_._

But we do care about the license change, and making sure that we’re not infringing on any trademarks (real, or imagined). And perhaps most importantly…Benthos rocks. People should be able to continue to use it freely without worrying about when the commercialization bell might toll.

Back to WarpStream, though: our unique BYOC deployment model means that our customers deploy our code and binaries into their environments. Some of the code in the core Redpanda Connect repo is still MIT-licensed, and we technically could have kept using some of it, but we couldn’t wait around to find out what the next change would be. We have to ensure that one of our most critical dependencies is being stewarded in a thoughtful and responsible manner. We also cannot, in good conscience, include any software dependencies containing mixed or muddled licensing that could be subject to change (again) at a moment’s notice. Our customers deserve more stability and predictability than that.

The Fork

When we started WarpStream, we didn’t see ourselves becoming the maintainers of a major open-source project. In fact, we explicitly decided to not make WarpStream “open core” or “source available” from Day 1 because we hated the perverse incentives of that business model: hack distribution under the umbrella of “open source” for zero to one, and then pull the rug later by gating features, changing licenses, or crippling the open source project after the fact once critical mass has been achieved.

I’ll say it explicitly: We really didn’t want to create a fork. But we think that this is the only responsible thing to do given what’s happened already in just a few hours since the acquisition was announced.

So, we’re forking Benthos. We’ll be maintaining our fork as a free, open source, 100% MIT licensed project, just like Benthos was before.

Benty the Bento Box is not happy about the fork.

We’re calling our fork Bento3, an homage to the Benthos name but separate and distinct from what is now called Redpanda Connect. We were proud to sponsor the Benthos project when it was an independent open-source project, and now we’re looking forward to fostering a new project to carry on where Benthos left off.

We hope Bento becomes a place where the Benthos community can land and contribute to the project in the same spirit that the Benthos project once had — but if that doesn’t happen, that’s fine by us. We’ll keep maintaining it because our product, WarpStream Managed Data Pipelines, relies on it.

You can find the new Bento repository here, as well as a hosted version of the original docs in all their glory.

You might be thinking, “Wait a minute, isn’t WarpStream just another corporation? Why should I spend my time contributing to their project if they can just take my contributions at any time and commercialize them?”. Bento is 100% MIT licensed and will stay that way forever. In addition, we want to move to a shared governance model, with other official maintainers, and create an independent structure. However, before we can do that, we first need to find some other maintainers!

So, if your company has any commercial interest in Benthos (even if you’re a competitor!) and is worried about the recent ownership and licensing changes, please let us know. We’d love to collaborate on contributions, bring you in as an official maintainer, create an independent Github organization with dedicated CI/Docker infrastructure, and establish a formal governance structure.

So please, join us: check out the new repository, docs website (new domain coming soon), or even just get in touch if you want to learn more about Bento or participate in stewarding the project.

P.S we hate that Benty is an ai-generated mascot. If you’re a talented illustrator with some ideas, please reach out as well (we pay well!).‍

Footnotes

We’re pretty sure this isn’t how copyrights, software licensing, and trademarks work (like, at all), but we also didn’t feel like arguing about it, or getting the lawyers involved.
This relicensing was done with the justification that “all the users that are using those services are used to paying for the integration with those services.” This seemed to us like a clear signal of more potentially hostile things to come.
Yes, this is the best we could come up with. If you’re good at naming things, we’re hiring our first Product Marketer!

Zero Disks is Better (for Kafka)

WarpStream — Tue, 28 May 2024 17:59:34 +0000

by Richard Artoul

Zero Disk Architectures

In our previous post, we discussed why tiered storage won’t fix Kafka and how it will actually make your Kafka workloads more unpredictable and more difficult to manage. The fundamental problem with tiered storage is that it only gets rid of some of the disks. Even if we minimized the amount of time that data was buffered on the local broker disks to 1 minute, all of the issues in our previous post would still remain.

Tiered storage is all about using fewer disks. But if you could rebuild streaming from the ground up for the cloud, you could achieve something a lot better than fewer disks — zero disks. As we’ll demonstrate in the rest of this post (using WarpStream as a concrete example), the difference between some disks and zero disks is night and day. Zero Disk Architectures (ZDAs), with everything running directly through object storage with no intermediary disks, is better if you can tolerate a little extra latency. Much better.

Don’t just take it from me, though. Less than one year after our initial product launch and announcement that “Kafka is Dead, Long Live Kafka”, almost every other vendor on the market (Confluent included) has announced their plans to follow suit and retrofit their existing architectures over the next few years. With Confluent effectively abandoning the official open-source project in favor of their proprietary Kafka-compatible cloud product, it’s safe to say that Apache Kafka is well and truly dying, and what will live on is the protocol itself.

That said, I think that the industry conversations around Zero Disk Architectures have missed the forest for the trees by focusing exclusively on reducing costs. Don’t get me wrong, reducing the cost of using Kafka by an order of magnitude is a huge deal that cannot be understated, but it’s also only one small part of a much broader story.

In the rest of this post, I will do my best to tell the rest of the story and explain how WarpStream’s Zero Disk Architecture enables developers to do so much more than they ever could before, not just because they reduce costs by an order of magnitude, but because the architecture itself enables radically new functionality and deployment strategies that were previously impossible.

Specifically, I’ll cover three different topics that have been left out of the conversation so far and explain how Zero Disk Architectures:

They are radically simpler and eliminate huge amounts of complexity and operational burden.
Enable completely novel functionality that was previously impossible.
Heavily tilt the scales in the SaaS vs. BYOC debate in favor of a completely new “Zero Access” BYOC model, at least for Kafka and the data streaming space.

Maximum Simplicity

Stateless compute enables elastic scaling

Let’s start with the basics. The most obvious benefit of WarpStream’s Zero Disk Architecture is auto-scaling. Since the “brokers” (Agents in WarpStream’s case) are completely stateless with zero local disks, they can be trivially auto-scaled in the same way that a traditional stateless web application can: add containers when CPU usage is high, and take them away when it’s low. No custom tooling, scripts, or Kubernetes operator required. In fact, when deployed in Kubernetes, WarpStream Agents are deployed using a Deployment resource just like a traditional stateless web server, not a StatefulSet.

WarpStream cluster auto-scaling in ECS

The graph above shows a real WarpStream cluster auto-scaling automatically based on CPU usage. Think about just how many moving parts would be required to pull that off with a stateful system like Apache Kafka that has any local disks or EBS volumes, and by contrast how it just falls out of the architecture, effectively for free, with WarpStream.

This operational simplicity completely changes how developers run and use the software. The number of users of existing data streaming products who can actually take advantage of anything remotely resembling actual auto-scaling is infinitesimally small. By contrast, almost every WarpStream BYOC customer is leveraging auto-scaling. It’s so easy that it becomes the default.

Isolate workloads with Agent Groups

Another benefit of WarpStream’s simple architecture is that since no Agent is “special” and any Agent can handle writes or reads for any topic-partition, massively scaling out writes or reads on a moment’s notice is feasible, just like with a traditional data lake.

If you’re worried that your massive data lake queries will interfere with your production workloads, you can just deploy an entirely new “group” of dedicated nodes whose only job is to act as thick proxies/caches for the data lake workloads while a completely different (and isolated) set of nodes handles the transactional workloads.

This approach brings the promise of HTAP to the data streaming space, with both transactional and analytical workloads completely isolated from each other but operating on the exact same dataset and with zero replication delay. No ETL is required.

WarpStream calls this feature Agent Groups and it underpins a deeper insight about Zero Disk Architectures: they enable radically flexible topologies and deployment models. For example, in addition to using Agent Groups to isolate transactional workloads from analytical ones, you can also use this feature to completely isolate producers from consumers:

Agent Groups can also be used to sidestep networking barriers entirely by deploying different groups of Agents into different VPCs, cloud accounts, or even regions and using the object storage bucket as the shared network layer:

This makes it trivial to create single logical clusters that span traditional cloud networking boundaries in an easy, simple, and cost-effective manner. While this may seem “boring”, in practice, many modern organizations need their data to be available across multiple “environments” and without the ability to leverage the object storage layer as a network, they have to resort to a complex, brittle, and expensive mess of solutions involving VPC peering, NAT gateways, load balancers, private links, and other exotic cloud networking products. With WarpStream, traditional cloud networking can be bypassed entirely in favor of using the object store itself as the network.

I know what you’re thinking at this point: “Come ON Richie, I thought this was going to be a cool article about some brand new whiz bang tech. Cloud networking!? Really? Who cares!?” And I get it. Cloud networking is mind-numbingly boring. But it is also really important, especially for Kafka, because Kafka is the beating heart of many organizations’ tech stack. It powers their internal observability pipelines, enables different teams to share data with each other, serves as the source of truth write ahead log for internal databases, enables CDC from operational datastores to analytical ones, and so much more. At its core, Kafka decouples the producer of a specific piece of data from its (often many) different consumers. None of that is possible if those producers and consumers are separated by an impermeable (technically or financially) network boundary.

Integrate All the Things with Managed Data Pipelines

Another great example of how Zero Disk Architectures enable completely novel functionality is that since the WarpStream Agents are completely stateless, they can be made significantly more feature-rich than traditional Kafka brokers. For example, imagine you’re using WarpStream to ingest application and AI/ML inference logs into an external system like ClickHouse. Another team requests that the data be made available as parquet files in object storage so that they can interact with it using a variety of different tools, and also because the security/compliance teams want a historical archive outside of Kafka.

With WarpStream’s Managed Data Pipelines, all you have to do to enable this functionality is click a few buttons in the WarpStream UI and paste in this configuration file:

input:
    kafka_franz_warpstream:
        topics:
            - logs
output:
    aws_s3:
        batching:
            byte_size: 32000000
            count: 0
            period: 5s
            processors:
                - mutation: |
                    root.value = content().string()
                    root.key = @kafka_key
                    root.kafka_topic = @kafka_topic
                    root.kafka_partition = @kafka_partition
                    root.kafka_offset = @kafka_offset
                - parquet_encode:
                    default_compression: zstd
                    default_encoding: PLAIN
                    schema:
                        - name: kafka_topic
                          type: BYTE_ARRAY
                        - name: kafka_partition
                          type: INT64
                        - name: kafka_offset
                          type: INT64
                        - name: key
                          type: BYTE_ARRAY
                        - name: value
                          type: BYTE_ARRAY
        bucket: $YOUR_S3_BUCKET
        path: parquet_logs/${! timestamp_unix() }-${! uuid_v4() }.parquet
        region: $YOUR_S3_REGION
warpstream:
    cluster_concurrency_target: 6

The WarpStream Agents will now automatically consume the logs topic and generate the parquet files in S3. No custom code or additional services are required, and the entire data pipeline will run in your cloud account, on your VMs, and store data in your object storage buckets.

Now, if you’re an SRE or engineer who has ever been on-call for Apache Kafka, you might be feeling extremely uncomfortable right now. “Won’t that increase CPU usage on the Agents!?” The answer to that question is: “Yes.”, but also: “Who cares?”. The CPU auto-scaler will add more Agents if necessary without any manual intervention, and you can leverage WarpStream’s Agent Roles and Agent Groups functionality to run the data pipelines on isolated pools of Agents for larger workloads.

Is this the most efficient way to perform this task at a massive scale? Definitely not. Is it a reasonable and extremely cost-effective solution for 99% of use-cases? Absolutely. Unlike Kafka brokers, WarpStream Agents are cattle, not pets, and can be treated as such.

Zero Access and BYOC-native

The SaaS vs. BYOC debate has raged in the data streaming industry for a long time, and for good reason. To recap briefly: the BYOC deployment model is unbeatable in terms of unit economics and data sovereignty, which makes it very appealing for the highest scale workloads.

However, legacy BYOC models come with a lot of trade-offs as well. For example, legacy BYOC doesn’t necessarily present as a “fully managed” service in the way that most customers would expect from a, well, “managed” service. Historically most of the BYOC solutions on the market were just repackaged versions of the same stateful datacenter software that end-users have been struggling with for almost a decade now. Having that stateful datacenter software managed inside the customer’s cloud account / VPC by the vendor certainly helps, but it also doesn’t eliminate all of the fundamental problems associated with running that software in the cloud in the first place.

In addition, to make it all work, the customer has to grant the vendor high level cross-account IAM access and privileges so that when something inevitably goes wrong with one of the stateful components, the vendor can tunnel into the customer’s environment and manually fix it. This works, but it represents a huge security risk and liability for the customer, since the vendor (and all of their support staff) effectively have root access to the customer’s account and data.

It’s easy to see why BYOC architectures occupy a unique niche in the space. They can dramatically reduce costs, but they also come with a lot of serious tradeoffs and risks. However, the emergence of Zero Disk Architectures changes the calculus significantly.

Specifically, since all the tricky problems related to storage (durability, availability, scalability, etc) can be offloaded to the cloud provider’s object store implementation, it’s now feasible for the customer to manage all of the (now stateless) compute / data plane on their own, with zero cross-account IAM access or privileges granted to the vendor. This makes it impossible for the vendor to access the customer’s production environment or data.

Of course, storage isn’t the only tricky part about running a data streaming system. The other part that can be really hard to manage is the metadata / consensus layer. That’s why WarpStream was designed from Day 1 with a “BYOC-native” architecture that not only separates compute from storage, but also data from metadata. This creates a shared responsibility model that looks something like this:

Storage scaling is handled by the cloud provider, metadata scaling is handled by WarpStream Cloud, and compute scaling is trivially managed by the customer by auto-scaling on CPU usage. We think that’s a game changer because the end-user is responsible for only one thing: providing compute, and that’s the one thing that everyone running software in the cloud knows how to do: schedule and run stateless containers.

When a customer deploys WarpStream into their environment, they don’t grant WarpStream Cloud any permissions or give WarpStream engineers the ability to SSH into their environment in an emergency. Instead, they just deploy the single WarpStream Agent docker image into their environment, however they prefer to deploy containers. Everything “just works” as long as that container has access to an object storage bucket and can reach WarpStream’s control plane to handle metadata operations. There’s really nothing more to it. In fact, this model is so secure that we even have customers leveraging this model in GovCloud environments!

In summary, this new “Zero Access BYOC-native” approach strengthens all of the core value propositions of existing BYOC deployment models (low costs, full data sovereignty), while also eliminating almost all of their drawbacks (remotely administering/scaling stateful storage services, security holes).

Zero Disk Architectures are Changing Everything (about Kafka)

I’ll conclude with this: Zero Disk Architectures are going to transform the entire data streaming space as we know it. Everything from pricing to capabilities and even deployment models is going to be flipped entirely on its head. But right now, WarpStream is the only Zero Disk Architecture streaming system on the market that you can actually buy and use today.

If you’re an existing Kafka user who’s struggling with operations, costs, or lack of flexibility, there’s never been a better time to try something new. WarpStream’s BYOC product is cheaper to run than self-hosting Apache Kafka, and its Zero Disk Architecture means you’ll never have to deal with partition rebalancing, replacing nodes, broker imbalances, full disks, Zookeeper/Kraft, or fussy cloud networking products.

‍Click here to get started for free. No credit card is required, and your first $400 in platform fees is on us.

Pixel Federation Powers Mobile Analytics Platform with WarpStream, saves 83% over MSK

WarpStream — Wed, 22 May 2024 15:31:37 +0000

by Caleb Grillo

Case Study

‍Pixel Federation is the developer of nearly a dozen highly popular mobile games with players from all over the world. They have millions of monthly active users, and those millions of users generate lots of events. In fact, Pixel Federation uses an event-driven architecture for almost everything: logging, events, billing, tracking game state, etc.

TrainStation2 by Pixel Federation

Like many other companies, Pixel Federation initially chose Apache Kafka as the message bus to power all of its real-time data streaming infrastructure. Instead of running open-source Kafka themselves, it started with AWS’s managed Kafka offering: MSK.

Initially, things worked great: developers found that instrumenting their applications to emit new events to Kafka was easy, and once other teams at the company realized how easy it was to tap into the flow of real-time data, they started consuming the data as well.

Before they knew it, Pixel Federation’s Kafka cluster had thousands of different topics, more than forty different consumer applications, and was being accessed by Kafka client libraries in 4 different languages. It’s no exaggeration to say that Kafka was the beating heart of Pixel Federation’s data infrastructure.

Unfortunately, this is also when they started to run into problems with their MSK setup. The first problem they ran into was that their bill was growing much faster than their actual data volumes were because they had so many different topics. MSK requires that Kafka brokers are upgraded to larger and larger VMs as the number of topic-partitions increases, even if data volumes remain flat.

The second issue, besides cost, is that like many organizations, Pixel Federation has a complex production environment with different VPCs and AWS accounts. This works great for isolating teams, enforcing security boundaries, and minimizing blast radiuses, but sometimes data in Kafka needs to be shared across network boundaries. For example, Pixel Federation’s game servers run in a completely different AWS account / VPC than their Flink consumers:

This meant that they had to peer their VPCs so that the MSK cluster in VPC1 could be connected to VPC2. If you’ve ever had to set up VPC peering before, you know just how difficult and burdensome it can be. MSK does offer an alternative using their multi-VPC private connectivity feature, but it adds an extra $0.006 / GiB of data transferred. In addition, Pixel Federation had to pay for inter-zone networking for all the traffic between their producers and the MSK brokers, as well as for the traffic between MSK and their consumers. Their average read amplification was 4x, so this resulted in a lot of inter-zone networking fees.

When they migrated to WarpStream, Pixel Federation took advantage of of WarpStream’s Agent Groups functionality to deploy a much more cost effective architecture instead:

They run a group of Agents in the AWS account / VPC that contains their game servers (the data producers) and those Agents write data directly to an object storage bucket that is shared across both of their AWS accounts. In the second AWS account / VPC, they run a second group of Agents that can consume the data written in the other account via the shared object store. In effect, they use a shared object storage bucket as both the storage layer and the networking layer to flex a single logical “Kafka” cluster across two different AWS accounts / VPCs.

This architecture is significantly more cost-effective than their previous MSK solution because they don’t have to pay for any EBS volumes or networking fees. In fact, before adopting WarpStream, Pixel Federation was spending more than $60,000/year on AWS MSK. By comparison, their total cost of ownership with WarpStream is < $10,000/year, a 6x savings on top of all the additional benefits they got with the migration, like the ability to use WarpStream Agents to flex their cluster across multiple VPCs, seamless auto-scaling, and no more manual partition rebalancing to keep their brokers evenly loaded.

Adam Hamsik is the CEO and co-founder of Labyrinth Labs, an AWS partner that has been working with PixelFederation for years helping them manage their cloud infrastructure. He had this to say:

“We have been using Kafka in our application infrastructure for years, and I really liked its scalability and versatility, but in cloud environments, the cost of managed Kafka clusters can be quite significant. As good engineers, we are always looking for the newest innovation that can save us AWS costs. Working with WarpStream Labs was an absolute pleasure. They went above and beyond anyone else we have ever worked with and tuned their application to our needs.” — Adam Hamsik, CEO of Labyrinth Labs

Get Started

If you’re ready to save money and reduce your operational burden, you can sign up for WarpStream and get started in just a few minutes. New signups get $400 in free credit with no expiration, and no credit card is required to get started.

WarpStream Newsletter #3: Always Be Shipping

WarpStream — Fri, 17 May 2024 14:36:58 +0000

Welcome to the third issue of the WarpStream Newsletter!

We have added a ton of new features since our last newsletter, and we’re excited to share them all with you in this update. We also started a YouTube channel that we’ll be using to share video content. Subscribe to our channel to stay up to date on conference talks, informational videos, interviews, and other content.

Also, a few of us are in Bangalore for Kafka Summit this week. If you are attending Kafka Summit Bangalore, drop by our booth and say hello!

Join us on X , Slack, and ** Discord **!

What’s new

Blog post: Tiered Storage Won’t Fix Kafka

The tiered storage architecture has been proposed as a solution to many of Kafka’s problems. Unfortunately, it doesn’t really live up to the hype.

We released a blog post this week discussing the fundamental flaws with tiered storage for Kafka. In this post, we argue in favor of a disk-less architecture, not a bolt-on solution that causes at least as many problems as it aims to solve. Check it out, and let us know what you think.

Blog post: Cloud Disks are (Really!) Expensive

It’s a fact: disks are expensive in the cloud. But many Kafka users underestimate just how expensive the disks actually are. So we wrote a blog post that breaks down the cost of the local storage for a Kafka cluster.

Support for multiple control plane regions

You can now configure your BYOC clusters in WarpStream to communicate with an instance of the WarpStream control plane in three additional globally distributed regions in AWS. The supported regions are:

us-east-1
us-west-2
eu-central-1
ap-southeast-1

To reduce round-trip request latency, configure your WarpStream agents to communicate with the control plane region that is geographically closest to where your agents are deployed. Review the documents to learn more.

We also now support deploying Serverless clusters in both us-east-1 and ap-southeast-1.

Setting up a new control plane region is straightforward, so you’re interested in another region, let us know and we’ll deploy the control plane there.

Add your team

You can now invite your team members to your WarpStream account so you can collaborate in the same workspace. Navigate to the Teams page in the console to invite your team.

Support for mTLS

WarpStream now supports mTLS between your clients and WarpStream agents. To learn more about how to configure this feature, review the docs.

Let us know if you have any questions!

Benthos integration (beta)

The WarpStream Agent now has built-in support for Benthos, an open-source framework for streaming transformations and integrations. You can use Benthos to integrate WarpStream with other systems, such as databases and other messaging systems, using a simple, easy to use configuration.

You can read more about WarpStream’s built-in Benthos support on our blog. Or, check out our docs to learn how to configure Benthos on WarpStream. This feature is currently in beta, so feel free to try it out and give us feedback.

Interested in learning more about WarpStream?

Book a call

What’s next

Kafka Transactions

The team is currently working on support for Transactions for Kafka. This is the last remaining major Kafka protocol feature missing from WarpStream. We expect to release support for Transactions in the next few weeks.

Transactions are often used in stream processing workloads, so we are excited to be able to onboard those use cases! If you’re interested in learning more about our work in this area, contact us and we would be more than happy to tell you about it.

Built-in offset preserving replication

One of the biggest hurdles to migrating to WarpStream is the ability to transparently migrate clients from a Kafka cluster to a WarpStream cluster. We are currently working on a solution that will provide replication for both data and metadata, so you can directly mirror a Kafka topic (or topics) from any Kafka (or Kafka-compatible) cluster to your WarpStream cluster. Unlike existing replication solutions, such as MirrorMaker, WarpStream’s built-in replication will also preserve offsets, so you will be able to transparently switch consumer clients from your Kafka cluster to WarpStream.

Schema Validation and Schema Registry

We are currently working on building full schema validation into the Agent. This feature will work with an external schema registry. We will then follow up with our own implementation of a WarpStream schema registry, so soon you won’t have to run an external system to ensure that your data conforms to the expected schema.

Sign up and try WarpStream for free!

WarpStream is free to try.Sign up now and discover how you can save 80% on your total cost of running Kafka.

Introducing WarpStream Managed Data Pipelines for BYOC Clusters

WarpStream — Tue, 14 May 2024 14:12:06 +0000

Stream processing made even more operationally mundane

We previously launched embedded support for Benthos in the WarpStream Agent, which enables WarpStream users to use Benthos without managing any additional infrastructure.

Today, we’re excited to announce that we have taken this feature a step further with WarpStream Managed Data Pipelines. A new feature for WarpStream BYOC clusters, Managed Data Pipelines provide a fully-managed SaaS user experience for Benthos, without sacrificing any of the cost benefits, data sovereignty, or deployment flexibility of the BYOC deployment model.

Deploy multiple Pipelines from the WarpStream console

Benthos is a lightweight stream processing framework that offers much of the functionality of Kafka Connect, as well as additional stream processing functionality like single message transforms, aggregations, multiplexing, enrichments, and more. It also has native support for WebAssembly (WASM) for more advanced processing.

Previously, WarpStream was just a replacement for Apache Kafka®, but with Managed Data Pipelines WarpStream can do a lot more out of the box. For example, the WarpStream Agents can now directly connect with external systems, stream data between topics, perform on-the-fly data transformation, stream data into downstream systems, and much more, all with a simple YAML configuration.

In addition to all the native features of Benthos, WarpStream Managed Data Pipelines also provide:

A user interface in the WarpStream Console for creating and editing pipelines.
The ability to pause and resume pipelines on demand.
Version control and branching allow you to easily deploy changes or roll configurations backwards as needed.
Automatic handling of SASL authentication, ACLs, AZ-aware routing, and many other WarpStream-native features.
Control over concurrency, with distribution managed by WarpStream and controlled with a single line in your configuration.

Control versioning and deployment from within the WarpStream console

Managed Data Pipelines is the natural evolution of WarpStream’s BYOC product. In the same vein that WarpStream is a novel implementation of an open protocol with no vendor lock-in, Managed Data Pipelines is a cloud-native, BYOC-managed version of a popular open-source stream processing framework, enhanced with WarpStream’s signature data plane/control plane split. Pipeline configuration, version control, clustering, and pipeline deployment are all administered remotely using a SaaS UX, but the actual pipelines run in your cloud account, using your compute resources and your object storage buckets. Raw data never leaves your account.

To learn more about how to use Managed Data Pipelines, check out the docs, and if you have any questions, feel free to join our Slack community. You can get started with WarpStream for free, with no credit card required, and start streaming with Managed Data Pipelines in just a few minutes.