Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

In this video, Ashish Palekar and Sai Maddali from AWS discuss operating Apache Kafka and Flink at scale, covering Amazon MSK and Managed Service for Apache Flink. They explain MSK's three offerings—standard brokers, Express brokers, and Serverless—emphasizing Express brokers' advantages including eliminated storage management, 90% faster failure recovery, and 20x more elasticity. Key topics include storage scaling challenges, failure handling with proper buffers, and the new intelligent rebalancing feature delivering 180x faster rebalancing. For Flink, they detail architectural concepts like event time, windowing, and checkpointing, plus operational best practices around monitoring job availability, state hygiene, avoiding data skew, and proper serialization. They demonstrate how their managed services reduce mean time to recovery through blue-green deployments, warm pools, and automatic healing systems that classify and surgically fix issues.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Operating and Scaling Managed Apache Kafka and Flink at AWS re:Invent 2025

Welcome to Vegas. Welcome to re:Invent 2025. You are in ANT 307, Operating and Scaling Managed Apache Kafka and Flink. My name is Ashish Palekar. I take care of our Kafka, Flink, and Firehose services here at AWS, which means that I spend time talking to customers, building the service, and operating the service. With me is Sai Maddali, who leads the product teams for our Amazon Managed Streaming for Apache Kafka, Amazon Managed Service for Apache Flink, and Amazon Data Firehose.

Today we are going to cover a bunch of what we are doing with our Kafka and Flink services. We're going to walk you through our learnings and compare and contrast, give you a sense of what we've learned, and hopefully we can cover a lot of details in the session. A lot of customers often start with why streaming data and why now. It really boils down to four things that we hear from customers. It's about unlocking real value from your data in real time as it flows, which helps you make decisions in real time and shorten the time to actually get to insights. You can get continuous intelligence from that data. And last but not least, as we are continually seeing now with AI workloads, you need more contextualized data. You need more fresh data. Streaming technologies help you get that.

Here at AWS, we love Kafka and Flink. We have had Kafka and Flink services for well over five years at this point and have been operating workloads for customers. We have customers who also use self-managed Kafka and Flink in addition to ours. And there's certainly other partner solutions as well. But really, we see a large and wide variety of workloads across the board.

One of the things I get asked by customers is what are the outcomes that customers look for when running at scale. What do customers really care about? It really boils down to price performance, reliability, security, and performance. These seem like basics, but at the same time, at scale, these are magnified. Every one of you that is operating a Kafka or Flink service knows that. Just by show of hands, how many of you operate a Kafka service? Vast majority of you. Awesome. So this helps Sai later on, how many of you use Flink? Few of you. We'll talk about that too. Thank you for being our customers.

Customer Success with Nexthink and the Spectrum of Amazon MSK Offerings

These are some other customers. Customers make our world go round and this is absolutely super helpful for us to understand what customers are doing. Let's talk about running Kafka at scale. I always love to start with a customer example. Nexthink is a customer based out of Switzerland and really focused on how to build developer experience. Their core problem was how to help scale their Kafka business. They switched to Amazon MSK from on-premises and they can now process trillions of events a day, reaching five gigabytes per second of aggregated throughput. It has been compelling to watch their journey in scaling from 200 megabytes to five gigabytes per second. That has taught them, as well as us, a whole bunch of lessons as we observe workloads across our customers.

When we think about the spectrum of Kafka and the services we offer, we think of it across a two-pronged spectrum. On the left-hand side, you have more Kafka control and on the right-hand side, you have more Kafka automation. In the more Kafka control case, we offer standard brokers for Amazon MSK. This is the case when you're migrating from an existing Kafka setup. You need fine-grained control for Kafka and have deep knowledge about how Kafka works. And that becomes important.

On the other end of the spectrum is Amazon MSK Serverless where you can quickly deploy and scale up Kafka. You don't need Kafka management and you don't need or care about the nuances of managing Kafka. And in between, this is something we launched last year, last November, is Express brokers for MSK where you value performance and elasticity where you want or need lower control over Kafka. Typically what we see is this is a set of customers that is managing Kafka at scale.

This is something I provide as guidance to customers when I talk to them. Unless you know differently, I always recommend you start with Express brokers. It's a good place to start for your workloads, and then you can decide where to go from there. Across this family of products, we see a diversity of Kafka workloads, and it gives us really interesting insights. We see customers operating at small scale and large scale. We see customers operating with high partition counts, low partition counts, and combinations of both. Each of these entails different considerations.

Storage Management Challenges and How Express Brokers Eliminate Them

One of the first areas I'll discuss today is storage management. This starts to get tricky, especially at scale. Specifically, storage scaling takes time. You can go from four terabytes to eight terabytes to twelve terabytes, and all of this is possible. One of the things we consistently see is customers who are not monitoring their storage utilization. Your storage utilization can spike as workloads change. If you haven't already, there is something Amazon MSK offers that you should subscribe to: disk full alerts. You get alerts at sixty percent full, at eighty percent full, and then when it's actually full. This will alert you to ensure that your capacity is actually taken care of.

One thing we noticed is that it is easy to scale up capacity, but you don't have a capacity scale-down option. This is something that trips up customers, and customers often delay making a decision about scaling up. It's a tension that you need to wrestle with as a customer. Recognizing this challenge, one of the things we did when we built Express brokers is eliminate storage management. What we operated with is really starting with not having any storage configuration. You just tell us your retention period, and we are off to the races. You get virtually unlimited storage capacity per broker, which means you really don't have this conceptual notion of what a disk full looks like.

Your storage is provisioned per cluster versus per broker, and that obviously has cost savings. Equally importantly, it makes it easier to imagine what your overall workload size is. You pay for what you use versus needing provisioned capacities. Last but not least, you get hands-free and instant scaling on the storage side, and that is also very useful. If you really look at the core tenets of how we worked on storage management in the Express context, our focus was to eliminate storage management for you, and that entailed doing all these five things.

Failure Handling in Kafka: Planning for Broker Failures and Recovery Time

The other area where we see customers challenged is around failure handling. What does that mean? Well, here we have a three-broker cluster. You have producers, you have consumers, and you have consumers in the consumer group. The producer is producing to two brokers here. So what does the basic traffic look like? You have producer traffic, you have consumer traffic, you have replication traffic, and depending on your workload, you have partition compute needs. Very rarely do we encounter workloads, or I guess not so rarely, we do encounter workloads that are partition bound. You have to account for that as well.

This is the basic configuration that as customers you need to account for: what's your producer traffic, what's your consumer traffic, what's your replication traffic, and what are your partition compute needs? And then failures happen. Let's say a broker dies. What happens in that case? You need buffers for handling broker failures. Your producer swaps, your replications switch, your leadership changes, and then the broker comes back. This means your first broker has to actually account for the throughput of two producers. Now your broker comes back, and there are different workloads that you need to account for. What is your catchup replication? What does that look like? And eventually, as the broker rebalances and the producer reswitches, you also need to account for what happens when there's a consumer rebalancing. There is some interesting new innovation happening in the Kafka space around this. But all of these are things that you have to build in when planning your workloads.

Too often we encounter customers who have workloads that work fine when things are going well, but when failures happen, that's when they encounter problems. These are some things that we ask customers to consider.

We ask customers to think through what their failure handling actually looks like. Failure recovery is also time-consuming. It's not just about having the space and buffers to account for these failures. It's also about accounting for time. In the standard Kafka case, you're scaling compute and storage together, and especially after a node recovers from failure, it has to catch up. This process is time-consuming and non-deterministic. What does that mean? It's dependent on your storage throughput, storage capacity, and compute to local storage network throughput. You have all these different factors competing for attention in that infrastructure, which means your recovery time is dependent on all of them.

Kafka is a remarkably stable system until failures happen. At that point, what customers can do is wait for recovery. That's often where many of our customer conversations happen—around how they've thought about failure recovery and failure recovery in these modes. On the Express side, we rethought how failure recovery happens. Because we separated our resources, your recovery is up to 90% faster, right? That fundamentally means you're getting high resilience because your mean time to recovery is low. That is a phenomenally important thing when designing resilient distributed systems—how do you keep your recovery time low so you can build resilience for your systems. That is how Express brokers is different.

Horizontal Scaling and Intelligent Rebalancing: Achieving 180x Faster Performance

What does that mean for horizontal scaling? Everyone has a plan until they have to rebalance partitions, right? That is what Kafka horizontal scaling looks like. You add three brokers and you're going to move partitions across. You have to reserve enough bandwidth for rebalancing. You have to carefully orchestrate the movement so that it does not impact your existing workloads. And last but not least, you're continuously monitoring your infrastructure so that if there's a need to change the plan, you can. Rebalancing can take hours and is again variable. That is how the system behaves.

On Express brokers, because we have separation of resources, what that fundamentally means is you can get up to 20x more elasticity compared to standard brokers. I often get this question from customers about what elasticity means. Let's look at an example. How fast can I rebalance my partitions after adding brokers while traffic is still climbing? Here's an experiment that we ran. We had three brokers, one topic, 2,000 partitions, four terabytes of data per broker. We had three brokers and added three brokers, rebalanced 1,000 partitions, all while producing 90 megabytes per second, which is pretty typical for what we see. We compared standard to Express and I'm going to show you a couple of CloudWatch graphs.

Here's what standard looks like. You have three brokers at the top. The left red line is where we issued the command to add three brokers. You can instantly see the throughput drops and continues dropping. Eventually the new three brokers start to pick up. You can see the throughput gets throttled further and eventually stabilizes as if it's six brokers. You get producer degradation until the throughput recovers, and that entire process takes about 175 minutes. You're going through partition rebalancing and adding three brokers until you get productive six brokers for about 175 minutes.

Here's what this looks like for Express brokers. The left is where we added the three brokers. The brokers went to the new throughput and within 10 minutes, the remaining three brokers were added and they joined the cluster and produced the traffic. From a customer standpoint and for you as users of this, what that means is you're spending less time doing partition rebalancing, which is giving you a more resilient system.

This also means you can scale up and scale down faster. You can run this experiment and see what your results look like. A few weeks ago, we announced intelligent rebalancing for Express brokers. With intelligent rebalancing, you get always-on rebalancing, auto partition placement, and fully automated horizontal scaling with zero-click automated heat management, which gives you improved price performance from a cluster standpoint.

Intelligent rebalancing delivers up to 180x faster rebalancing, which means shorter recovery windows and higher resilience compared to standard brokers. One of the interesting things the team implemented is built-in operational awareness. We detect when healing or patching is going on and prevent unsafe overlap. Additionally, we apply all our best practices for scale-in and scale-out intelligence before resizing clusters, which gives you a signal of what workloads your brokers can accept. This feature is now available for Express brokers and has been in customer hands for about two weeks.

Rethinking Resilience with Express Brokers and Essential Kafka Monitoring Practices

We had to rethink what resilience looks like. Here's what standard Kafka behavior looks like: you have a broker with partitions, memory, compute, network, and storage. Kafka classically is an allocate-when-asked system. You can request capacity and it will give you capacity; you can create a partition and it will create a partition. The onus is on Kafka clients and users to not overload the system. The challenge is that you only find out you have overloaded the system by monitoring it well, having heuristics, and having experience managing your workloads. When overloading happens, it causes broker failures, and you enter recovery mode.

With Express, we have the same broker model, but we have added safe dynamic throttles on compute, memory, and networking. This allows us to minimize the impact on the broker and protect it from failures. The team also thought through what it means to survive storage failures. While you cannot survive storage failures indefinitely, it builds in resilience and gives you the capacity to survive without storage for a while. As we throttle, your client can react to the throttling behavior it sees instead of just seeing the failure and reacting to it. This means you can design your system to take cognizance of the system being overloaded and account for it differently, which is fundamentally different from what we had to do to re-architect for resilience.

We made it so you do not need a maintenance window for patching. We also built in partition-level fairness, which means you get protection from noisy workloads. A single partition cannot take up all the throughput of your broker, and you get better protection. The team also thought through what it means to stop potential cascading faults and how to isolate them by design. Nothing is perfect, but there is a lot of thinking that went into building systems that can survive. Walking you through this helps you understand what it means to actually operate and run these systems at scale.

Some customers come to us and tell us they want even lower management. So we built all of this infrastructure and knowledge into Amazon MSK Serverless.

With MSK Serverless, you operate at the level of scaling up and scaling down with zero Kafka management. We normally recommend it for customers new to Kafka or new to streaming. I debated putting this slide up, and I debated very deeply about whether it seems remedial. Yet it is one of the more common failures we see with customers: monitoring Kafka. What I will show you on the next slide will seem obvious, but it is remarkable how many customers are not set up for actually monitoring these systems.

These are the things we recommend you monitor in your Kafka setup. I will start with the per-broker and per-cluster items. Monitor partitions per broker, total partitions per cluster, and connections per broker. These are critical to track, and you should set alerting so that you can react to any issues. On the usage side, think about your broker CPU usage, disk usage, memory usage, and throughput usage. These simple metrics will give you a view of what it means for the service and system to operate. It is amazing how important it is to balance not just the operational side of running the system, but also the monitoring and making sure you are doing the right thing. That is what it means to run Kafka at scale.

Running Apache Flink at Scale: Real-Time Processing, Statefulness, and Architecture

Up next, Sai will tell you about what it means to run Flink at scale. Well, thank you so much, Ashish. Now that you have all the data streamed in Apache Kafka, you need to process it, and that is where running Apache Flink at scale comes to fruition. How many of you have some of these use cases: anomaly detection where you are transforming data before loading it into an OLAP system? How many of you are supporting building event-driven applications? If you are using any of those use cases or all three, then you should be using Apache Flink.

There are four reasons why. First, it is real-time. Second, it is great at handling dynamic datasets. Number three, it is stateful, and finally, it is programmable. Let us dive deeper into each one of these using an example. Imagine you are building a fleet monitoring system, and your goal is to detect anomalies as quickly as you find them so you can improve the customer experience. You start with a few nodes, and within no time, your fleet grows. Now you have tens and hundreds of nodes, and in no time, you have thousands of nodes to monitor and detect.

What is equally important to remember is that as your customers create new resources, you have new nodes to monitor, and when they delete resources, you no longer have to monitor those nodes. This set of data that you are monitoring changes rapidly. Somewhere along the line, you realize that your anomaly detection can be better if you also include storage monitoring into the mix. And then eventually software versions and regions. Very quickly, what you see is a big increase in the complexity of your data. This is where Flink's continuous processing is much better than using batch-based processing.

A batch is like capturing a photo. A photo captures what is true in the moment. You cannot track what happened before or potentially what has happened after. Comparatively, a video captures every frame and every detail. It gives you all the rich context available. You can pause it, go back to a point in time, compare different things, and get to the insight in a more accurate fashion. We also talked about one of Flink's strengths being that it is stateful. In our example, you are able to capture the state per key, per server, per software type. As new events come in, that state is incrementally updated by Flink.

What it does is wait for patterns to emerge and either decide that this is noise and suppress it, or it is a valid signal and act on it. In our example, it could be a breach in threshold where you say you want to monitor if you have more than five faults in any given minute, and if it breaches that threshold, you want to trigger an automated action, which is replacing that faulty node before the customer is actually seeing impact.

Being stateful gives you that rich memory to make more context-aware decisions. The last piece is that it's programmable. In our fleet monitoring example, you can start with simple if-then-else logic. If this happens, you're going to do this. But over time you realize you also want to monitor your storage, so your logic can evolve to now include other metrics. You can also map the data to a database table so that you can actually enrich that context. Finally, you can start with simple actions. In the fleet monitoring example, that customer started with creating an alarm so that a human can react to that alarm and replace that node. But over time they grew confidence in operating that system and started taking automated action using that workflow. This gives you the ability to constantly evolve your application. The combination of real-time processing, statefulness, the ability to handle dynamic datasets, and programmability is also the reason why customers are turning to Flink to build agentic AI applications.

So what does it take to do more data processing workloads in real time? When we talk to a wide variety of customers, three particular patterns emerge. First, it requires a change in your operating model. Second, you need resilient Flink infrastructure. Finally, you need to build programming logic that is robust. It sounds obvious, but it's very important. Let's start with what we mean by changing the operating model. A good way to understand that is to compare it with batch-based processing, something that has been popular over the years. In the batch world, your data is largely fixed. You're processing data that is either days old or a few hours old, and the analyst is running different sets of queries to find that insight. Then you're essentially leaning on a human to generally take an action. Compare that with a streaming system. Data is getting continuously generated, and then on this set of continuously generating data, you're applying a set of generally static rules to understand insights and almost automate the action. In our fleet example, the automation is replacing that faulty node or storage device.

Time is explicit in the dashboard. When you take a photo, the snapshot tells you that this was taken at this particular time. But when it comes to video, you have to specify that time explicitly. You want to go to this particular timeframe and then compare things. So time is explicit in streaming systems. That is what we mean by changing the operating model. You're moving away from dynamic queries on static data to static queries largely on dynamic data.

That also means that as developers, there are some new concepts for you to learn. First is event time, which tells you when the actual event happened, not when it's processed. For example, in a fleet monitoring system, you could get events related to storage before you get events related to the server. If you end up using processing time, you would miss processing events related to your server, and that means your action would not be accurate. So leaning on event time becomes very important. The second key concept is Windows. Because you have this continuous flow of data, Windows allows you to group these events into small buckets for analysis. The next important concept is partitioning. That essentially is how data is distributed so that you can parallelize your processing. Now, because Flink is trying to process data at really high throughput and low latency, partitioning becomes very important. Lastly, one of Flink's strong calling cards is its support for exactly once processing, which means that as a developer, you get the confidence that you're not worried about duplicates when it comes to processing. You're processing each event exactly once.

So how does Apache Flink support high throughput data at low latency with remarkable consistency? The answer lies in its architecture. The Job Manager is like a brain for the Flink system. It is planning, scheduling, and coordinating the work. Then you have the Task Manager where the actual processing is happening, and the Task Manager is further broken down into task slots, which aids for parallel processing. Here's how it works. You deploy your logic as a JAR, and then the Job Manager takes the code, first translates that into a logical unit of work, and then breaks that into physical units of work and places those work in task slots. Why? Because it wants to guarantee parallelism. Flink also captures the state, and we've talked about how important that is.

It periodically checkpoints its processing into a durable store within S3. The reason it does this is to give you the guarantee of exactly-once processing, but also to provide resilience when failures happen. When a job fails or infrastructure fails, Flink needs to know where to start and how to resume those operators processing data again. This combination of orchestration with the Job Manager, parallelism driven by the Task Manager, as well as checkpoints gives you the right set of parameters to run at scale.

Operating Flink Applications: Deployment Lifecycle, Blue-Green Deployments, and Code Optimization

As an administrator, what are all the activities you're responsible for to run these systems at scale? They largely fall into three categories: first is deployment, second is monitoring, and the last is scaling and evolution. Your developer comes to you wanting to build an application. The starting point is picking an infrastructure provider, such as a Kubernetes-based system. Then you have a choice to make: whether you want to run that in application mode where you have a dedicated set of infrastructure for that job, or you want to run in session mode where you can run multiple jobs on that same set of infrastructure. For most at-scale applications, you choose application mode.

Then you move on to monitoring once you've deployed the infrastructure. We'll dive deeper into this subject, but the core goal there is to make sure that you're always processing data. Finally, scaling the system is super important. We've talked about how programmability and evolution are key parameters for Flink. Your goal there is to monitor when you need to scale, do the scaling event, and also support deploying new code. That's what the lifecycle of a Flink job looks like and the activities around it.

Now, for developers running applications, this is a fairly involved process. At AWS, we offer a managed service for Apache Flink that provides the easiest way for customers to build resilient Flink applications. The developer experience with our managed service is far simpler compared to the deployment example I gave you before. There's no infrastructure setup. You get built-in multi-AZ resiliency and there is no configuration tuning.

Here is how it works. You go to your favorite IDE, build the application, get a JAR file out of it, create an application within the managed service, and then you're off to the races to start processing data. There's no compute to manage and there are no configurations for you to set up. Now that you've got this thing up and running, you have to operate it. What I'm going to do next is walk you through our learnings operating Flink applications at scale and how you can potentially benefit from it. I'm going to focus on two specific learnings.

The first learning we've had is that in Flink, a significant change can lead to processing being interrupted. What is a significant change? It could be that you're trying to scale the system. It could be that you are trying to handle a node failure, or it could be that you're doing patching and software upgrades. In each of these cases, what happens is Flink detects a change in the system, pauses everything, reassigns things, and then resumes processing. Our first goal in operating the system is making sure that we are minimizing these processing delays.

How do we do that? The key is to separate the time it takes to spin up EC2 instances and other prerequisites from the overall job downtime. We do that with two specific techniques: one is blue-green deployments, and the other is having a warm pool. With blue-green deployments, we do not stop the job until we set up and verify the prerequisites. Here is how it works: first, we spin up the necessary EC2 instances. Next, we verify your job configuration and make sure all the prerequisites are met. At that point is when we switch over from the old infrastructure to the new infrastructure and resume processing from the exact same point at which you stopped processing previously.

The next key concept is warm pools. Instead of waiting for EC2 instances to spin up and the amount of time it takes, we maintain a warm pool of infrastructure, thereby getting an instance that is ready to go. The combination of blue-green deployments and warm pools is how we reduce the time and how we reduce job downtime when there's a change in the system.

Another key aspect that we've talked about is how Flink allows you to change and evolve your logic. Our fundamental goal is to make this whole process simple for our developers. We want it to be repeatable, fail-proof, and quick so that they can consistently make changes.

A developer decides they want to make changes and build a new JAR file with their code changes, then submit it to the managed service. When we detect that a new deployment is happening, we do not stop the job right away. Instead, we take a snapshot of the current running job and durably store it in S3. Then the managed service sets up the new job with the configuration and sets up the new infrastructure, including EC2 instances, making sure they are spun up and running.

We set up the job configuration, validate everything, and then restore the state from what we have stored in S3. Now your operators are running with the preserved state, and that is when you switch over to the new infrastructure and resume processing. We pair this with smart guardrails, which detect any known code issues so they do not impact job processing. For example, imagine you upload new code that has some null pointer exceptions. The guardrails detect that this code is having issues and automatically reverse back to the last known good state of the application, then resume processing.

A combination of automation and the orchestration of how we deploy new code through blue-green deployments and smart guardrails is how we give developers the confidence to constantly evolve their applications in a highly resilient way. The first learning we discussed is that any change in Flink means there is job downtime, and we need to minimize it. The second key learning we have had is that developers and administrators in Flink often work with different abstractions. Developers think about core logical operators, while administrators think about resources and their utilization, and they have no visibility into the core.

Similarly, developers have no visibility into the actual infrastructure. We recognize there is a need for them to have a common mental model for how to reason about Flink. There are two key shared mental models that we recommend to customers. The first is fixed units. Typically, it is very hard and very unpredictable to map Flink code to actual resources like task managers and task slots. The reason is that different task managers can have different performance characteristics, and depending on how you have created slots, they can also have different performance characteristics. This results in a pretty unpredictable system.

The goal with fixed units is to bring predictability into the system. In our managed service, the smallest unit of provisioning is a KPU, which gives you one vCPU, four gigabytes of memory, and fifty gigabytes of storage. This means you get deterministic performance per parallelism. Each lane is also isolated, so the trouble with noisy neighbors is no longer an issue. It also makes debugging easier. When you have hot slots or issues, it is very easy to detect exactly where it is happening as an administrator and resolve that very quickly.

Lastly, it simplifies scaling. If you want to add more capacity, you simply add more parallelism units. You also get predictable state slices, so things like checkpointing and backpressure management become much more consistent for application administrators. For developers, fixed units mean your workflows are simple. You can design your code so that it fits in a fixed budget, and once you figure out what that deployment profile looks like, you can scale it by simply adding more KPUs. For administrators, by enforcing these guardrails, you are being more proactive rather than reactive when it comes to handling issues.

The second common way you want people to reason about Flink is to think about job availability rather than infrastructure availability. Job availability tracking in Flink is very hard. Flink states whether the job is running, but that does not always mean that data is getting processed. There are three reasons why that may happen. First, the boundaries between user code, Flink runtime, and external systems are not always clear, so debugging is very complex. Second, restarts and recovery mean that the underlying root cause is often masked. The second run, which starts producing different errors, will mask the errors that were actually caused by the original problem, making troubleshooting very hard.

Finally, issues can happen that are external to the Flink runtime, such as node failures or storage failures. It is important to track them and address them.

We recognized that we need a smarter detection system—a system that can detect issues, classify them, and be very surgical about the action it takes. We built a system that consistently monitors a Flink job and classifies whether an issue is related to user code or is system-driven. A user-driven issue could include problems with your code, such as null pointer exceptions, or resource exhaustion created from the code base. A good example is opening too many connections without closing them, or picking incorrect configurations for your job. All of these fall into the category of user code issues. System-related issues include node failures, Flink bugs, or generally issues with connectors. The automatic healer identifies error causes and classifies them. The reason classification is important is that it allows us to be very surgical about the fix we want to apply. For example, if we detect incorrect configurations, we can dynamically change those configurations, making restart much faster. The whole purpose of the system is to reduce the mean time to recover from failures.

Let's dive deeper using a few examples. The first one is performance degradation. Imagine you've built an application for the fleet monitoring system that works great at 100,000 records per second, but now you want to scale to 500,000 records per second. Those specific static configurations you picked no longer work for this increased scale. Flink gives you 400 plus configurations to choose from, which makes the whole process harder. Our detection system detects that there's an issue with incorrect configuration, fixes it dynamically, and thereby recovers the job very quickly from the performance degradation. The second example is something we see very typically with connectors. A good scenario is when Flink is trying to restart the job to recover from another failure, and it realizes there's a connector process that is not responsive. The detector identifies this issue with the connector process, kills that process externally, and thereby the job can restart very quickly. Here's an example where the goal is to recover faster from an issue.

Finally, failures can happen outside the Flink job. A good example is node failures. Failures do happen and infrastructure does fail. When it fails, we detect the issue, pick a node from the warm pool and then accelerate the recovery process very quickly for customers. These three examples demonstrate how having a detection system, classifying issues, and being very surgical with what we fix reduces the mean time to recover from failures. We've also recognized that availability is a function of your code. Issues such as high CPU, out of memory, and disk throttling can come from how the code is structured. Let's understand this using an example. Imagine a developer says they have a high CPU issue. You start diving into that job graph and realize the job has morphed into a wide variety of subtasks—1,600 plus subtasks—and they're trying to fit that into 10 KPUs. One way to solve the problem is to horizontally scale and add more KPUs, which is almost like throwing money at the problem. The other way is to look at whether the code is structured properly.

A couple of ways to think about this is whether there's excessive branching happening in the code. Going back to our fleet monitoring example, one way to branch is by alarm type. You're executing the same logic for each alarm type. If you have 20 alarm types, you have 20 branches. Another way to approach it is to ask whether there's something common across these 20 branches. You could potentially organize that by window sizes—one-minute windows, five-minute windows, and three-minute windows. So you can go from having 20 branches to three branches. Instead of throwing infrastructure at the problem, you reorganize the code so you have fewer subtasks and can fit into the 10 KPU we discussed. Another example is parallelism. Sometimes what you see is that not all task managers are busy. Some are running extremely hot while others are idle. This means you've picked an operator parallelism that doesn't work well consistently across the entire pipeline. This is a case where you could work with a developer to ensure they're very specific about the operator parallelism they're picking across different parts of the job.

These are a couple of examples where how your code is laid out can have an interesting side effect on the infrastructure. So what it comes down to is that availability is also a function of your code.

Monitoring Best Practices, Development Guidelines, and Key Takeaways for Running at Scale

What it comes down to is you need another layer of defense to make sure you're detecting these issues very quickly. Now, Flink gives you a wealth of metrics, and I'm not going to spend time walking through every metric, but what I wanted to share with you is a mental model around what our customers use when they operate Flink at high scale. The first category is application health. The goal here is not just the actual uptime, but also more importantly, is the job making progress? Because at the end of the day, it's all about processing data.

The second category is checkpoints. Checkpoints are Flink's safety net to recover from failure. You want to have a robust monitoring mechanism for these checkpointing data. You're looking at things like whether the duration of checkpointing is long and whether the sizes are pretty high. You want to make sure that system is very robust. The third important monitoring area is latency. At the end of the day, Flink is a real-time system, and you want to make sure you're always processing the freshest data. One way to do that is to compare your watermarks and see if they are slow or fast.

A slow watermark means there is lag getting built up, which means that eventually you have stalled task managers. The goal there is to compare, for example, your watermarks with your processing time to see how behind you are with the latest data that is being processed. The fourth dimension is throughput. Not just the aggregate throughput, which is easier to monitor, but also to monitor throughput to make sure there's not enough backpressure being built into the system. For example, if there's one part of the workflow that is getting backed up, then what it means is that spreads across the entire pipeline. The goal of that is to monitor throughput at the operator level and at the sub-task level, thereby you have the insights on where exactly within the pipeline you're having those backpressure challenges.

Those are the four different categories on how you can build a robust monitoring mechanism for your Flink applications. You also want to understand what are some of the best practices that we can share with the development teams. One of the successful patterns we've seen with customers is where they've democratized how to build Flink applications into their enterprise. You want to understand what is good core behavior and when it doesn't work and what the anti-patterns look like on infrastructure. The first area is timers and watermarks. We've talked about how event time is super critical, and so when you don't use event time, you end up using processing time. The typical complaint you see is, "Well, my aggregations are off." Well, that tells you that somewhere in the code people are using processing time versus event time.

Another scenario with timers is your watermarks are slow. If your watermarks are slow, this means there's too much lag. What if they're too fast? Well, if they're too fast, you're not waiting long enough for data that is trickling in slowly. You need to find the right balance between fast and being too slow. The second area is state hygiene. State is such a crucial aspect of Flink, but you don't want to keep everything in that state all the time. What you want is to enforce TTLs, thereby evicting state as and when it's necessary. Getting your state management right is super critical in your code.

The third big area is avoiding data skew. One of the calling cards of Flink is how you're able to partition the data and process each chunk independently. You do that by using keyBy operators. You want to think about what type of operators and what type of data patterns you want to use. You certainly want to avoid low cardinality analytic keys because what happens with low cardinality cases is you have some task managers that are running hot and some task managers that are running inefficiently. You want to pick a high cardinality key each time. Also, you want to reduce the number of shuffles you do.

Shuffles are critical because you want to shuffle data to get data based on one particular key and process that data differently. You would end up using shuffles, but using too many means data is getting shuffled too much. That also impacts how much serialized time is spent on serialization. That leads to CPU being busy, garbage collection being slow, and things of that nature. Lastly, serialization and schemas play a very pivotal role. Not all use cases would benefit from the default serialization. When you end up using the default serialization, especially for CPU-intensive workloads by doing transformations and complex transformations, that ends up again being an infrastructure issue. Schemas also play a very important role.

Instead of getting all the data and then validating it for quality, schemas allow you to enforce that on the producer side so that the Flink consuming code is confident that the data that it's getting is actually qualitative enough. In summary, writing optimal Flink code will make sure the applications are fast, resilient, and reliable. So that brings us to how and what it takes for us to run Kafka at scale. Now, if you want all these capabilities without the overhead, we offer our managed services.

I want to bring Ashish to share his key takeaways. We've talked about what it takes to run Kafka and Flink at scale. What are some of the top key takeaways from this?

Yeah, I think it's a really pertinent point. One of our motivations in doing this session was sharing our experience. A lot of the discussions, which I know Sai does as well and I do, are about customers who know Kafka but find that the effects of scale are different when they encounter running at scale or when they encounter failures at scale. One of our learnings from that was that oftentimes, the only place you actually learn about failures and how you handle them is when failures happen, and that's probably the worst time to actually learn those failures.

What we've tried to do is explain some of our thinking in terms of best practices, but also explain how we are building the system from the ground up to make sure that these things are accounted for. You saw this on the Kafka end as we transition from standard to serverless to Express, and you're seeing this on the Flink end on how we are applying that. Part of why we are sharing this is so that as you are making your choices about running your infrastructure, you are able to make a decision on which of these to apply, how you'll apply it, and how you'll run it.

The last big takeaway from me, and definitely looking for learning from you on the Flink side as well, is that too often customers are running infrastructure but actually not monitoring the things that can cause failures. What happens is that when failures occur or the SOPs are not functional, you discover that your system is not running as stably as you thought. Part of this is that there is a shared responsibility boundary in terms of managing the infrastructure, which we will do and we will own, and equally importantly, managing the operational side of Kafka or Flink, which you and us as customers have to participate and partner on. That, Sai, is my key takeaway. Partly what I'm hoping here is that we can help customers as they're thinking about their choices run better systems.

It's very similar to what we've spoken about on Kafka. I think the three things are knowing your shared boundaries. With Flink, the shared boundary also means you're running code, arbitrary code, so that has interesting effects that we talked about. Second is picking the right offering. Sometimes you can learn from our best practices and learnings that we built into our products and build infrastructure on your own or use some of our services. I think it's very similar to our learnings on the Kafka side as well.

Well, that brings us to the end of the presentation. Now you can try it for yourself, both MSK Express brokers as well as deploy your Flink applications or Managed Service for Apache Flink because at the end of the day, seeing and trying is believing. With that, thank you so much.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community