Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Amazon Kinesis Data Streams under the hood (ANT423)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Amazon Kinesis Data Streams under the hood (ANT423)

In this video, John Morkel and Ali Alemi provide a deep dive into Amazon Kinesis Data Streams architecture, revealing how it processes 807 million records per second during Prime Day and handles 70 petabytes daily across 3.4 million streams. They explain the newly launched Kinesis On-Demand Advantage with warm throughput capabilities and 40% cost reduction, plus large record support up to 10MB using token bucket throttling. The presentation includes a live social media sentiment analysis demo showing how warm throughput prevents data loss during viral traffic spikes. Ali details the internal architecture including the front-end API layer, cache service, membership service, timer service, and shard splitting mechanisms that enable horizontal scaling without data reordering. They cover best practices for producers using PutRecords API with record aggregation and consumers leveraging Kinesis Client Library version 3 for automatic load balancing across processing threads.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to ANT423: Amazon Kinesis Data Streams Under the Hood

We're about to find out all about Amazon Kinesis Data Streams under the hood. This is the 400 level talk, ANT423. Thanks for joining us. My name is John Morkel. I'm a Director of Software Development, and when I'm not speaking to nice people like yourselves, I lead the team that owns Amazon Kinesis Data Streams. Joining me today is my colleague, Ali Alemi. Ali is a Principal Data Streaming Architect, and he's going to be taking you through some of the finer details of Amazon Kinesis Data Streams. We're looking forward to getting into it.

Just a very quick overview of what we'll cover today. I want to give you a quick show of hands. I know this is maybe not everyone's favorite part of the talk, but who here has used Amazon Kinesis Data Streams? Great, a lot of you. Obviously, you're going to be quite familiar with some of the terms, but just to level set, I'm going to introduce some of these concepts so that we're all speaking the same language. We're quite excited about some of the new features that we've launched in the last month that have answered a bunch of requests that customers have been making of us.

I'm going to quickly go over the highlights there. Then I'm going to dig into a demo where we actually demonstrate a real-world streaming application. This is a live demo, or rather, it's a demo that you can download yourselves and run for yourselves. It was put together by Ali. I'll talk through a real-world streaming application that demonstrates some of this new functionality that we've just recently launched. Then I'll hand over to Ali, who's going to take us through the really interesting parts of Amazon Kinesis Data Streams under the hood.

What is Amazon Kinesis Data Streams and Its Massive Scale

What is Amazon Kinesis Data Streams? Many of you already know. It's a fully managed, serverless data streaming service. With any streaming application, you collect, process, and analyze data in a real-time setting. Amazon Kinesis Data Streams sets itself apart because it is serverless, it can scale, it's elastic, it is highly secure, and it's highly available.

The average streaming application looks something like this. On the left-hand side of this diagram, you've got your input for your streaming data. This could be clickstreams, it could be IoT devices, it could be mobile apps, or logs coming from your application servers. All sorts of different sources of data can come from your production network, from the internet, or from wherever. All of this data gets aggregated into a single point. These sources of data we call producers. The producers on the left-hand side send their data to Amazon Kinesis Data Streams here, which is your aggregation point.

Then it is consumed by a suite of different options. We've got services such as Managed Service for Apache Flink, Spark on EMR, Firehose, Amazon EC2, and Lambda if you want to write some custom code to process. These consumer applications are the heart of the streaming application architecture. Once you process this data, you will store that in some sort of repository, either a data lake or some sort of analytics database so you can use it and derive business value from it.

Another show of hands. Who bought something on Amazon on Prime Day this year? If you bought something, you were one of the people who pushed Amazon Kinesis Data Streams to a peak of 807 million records per second during Prime Day this year. To give you an idea of the scale of that, the global credit card transactions that are processed in a day could be done in about 3 seconds at this rate. That's a pretty phenomenal amount of information being processed through Amazon Kinesis Data Streams during that event.

Every day, over the 3.4 million or so streams that customers have on Amazon Kinesis Data Streams, we process about 70 petabytes of data. Zooming in a little bit onto the scale of an individual stream, customers are sending as much as 10 gigabytes of data per second through one single stream.

That sounds like a big number, but it's gigabytes with a big B. That's about a 4K movie every second, so we're talking about some serious data. Now we're zooming in even more from this very large wide scale that Kinesis Data Streams can operate at to the scale of a single Kinesis data stream. On an individual stream, you can have gigabytes per second of data ingestion, gigabytes per second of data fan-out to your consumers, and you're processing potentially up to millions of records per second. This can all be within a multi-tenant solution, so you can have different applications producing data and consuming data all within a single stream.

Four Foundational Pillars and the Shard Architecture

Kinesis Data Streams is what we call a foundational AWS service. This foundation is underpinned by four pillars. First, Kinesis Data Streams is elastic, meaning it will automatically scale up and down as your load necessitates. It's easy to use, which is honestly one of the greatest parts of Kinesis Data Streams. You don't have to manage any infrastructure yourself, and you don't have to worry about patching or network connectivity. All of that is taken care of for you because of the ease of use of Kinesis Data Streams.

It's reliable, so it operates with very low latency and predictable latency, and you can scale that all the way out while getting the same performance envelope. It's a really solid building block for any application to be built with. The last piece, as you would expect from any AWS service, is that it's highly available. We architect our service across multiple availability zones with many redundancies built in under the covers. You can be assured that you've got the best availability you can get for a streaming service without having to think about any of these complexities.

Now we're really zooming into the fundamental unit of a Kinesis data stream, which is the shard. A shard is the atomic unit of scale with very well-defined limits. A shard can do 1 megabyte per second of writes and 2 megabytes per second of reads. We support about 1000 records per second writes and 2000 records per second reads. This is the building block from which every streaming application can scale.

One of the interesting things about how a shard is used is the partition key. Your partition key is something that you specify when you write any data to Kinesis Data Streams. It is then hashed and distributed across your shards. The mapping between your partition key and your shards is automatically managed for you as your application scales up and down. This level of indirection really helps you build horizontally scalable applications using Kinesis Data Streams.

There are pros and cons of having this type of shard concept. It does enable scaling, and it's a convenient way to manage your capacity in discrete units so you can pay as you go. It gives you fundamental elasticity when thinking about how much capacity you're using. However, there are some challenges. There are mappings between your partition key and your shards, and sometimes this is not obvious at first glance. There is also complexity when things scale back in. When data gets redistributed across different shards, where does it go and how do you handle that? We'll get into some of the ways that Kinesis Data Streams helps you there. You can also have challenges where your consumers are potentially having uneven load across them. Although shards are very convenient as a building block, they are not a panacea, and there are some interesting things we've done to make them useful for you.

Evolution of Capacity Management: From Manual Provisioning to On-Demand Scaling

Some of the things we've heard our customers ask for on Kinesis Data Streams are that they don't much like managing capacity. Serverless obviously helps a ton with that in that you're provisioning things in generally abstract units of capacity.

However, on-demand scaling was something that customers were really asking us for. About four years back, we launched the on-demand version of Kinesis Data Streams where we would automatically monitor the amount of load that you were sending through to a particular stream. We would add shards and merge shards as needed. This meant that if you had a traffic spike, an unexpected surge of records came into your stream, we would automatically add new shards for you. You wouldn't have to worry about that. It would just happen transparently.

The other thing that customers are also asking us for is the ability to write larger records to Kinesis Data Streams. Up until quite recently, there was a limit of 1 megabyte for a record. Customers had to jump through some hoops in order to work around that. The real benefit of the on-demand version of Kinesis Data Streams is you had totally hands-free capacity management up to the top end of what we see with customers on streaming applications, which is about 10 gigabytes per second. This scaling happened for you automatically. You didn't have to set any thresholds or anything. It was just done as soon as you were using the on-demand version of Kinesis Data Streams.

You had the same performance envelope as you would normally have had if you were hand-provisioning or hand-managing your provisioned shards in the original Kinesis Data Streams provisioning model. With on-demand, you didn't have to worry about shard hours or unused shards. You were simply just paying for the data that you ingested, which was quite attractive as a pay-as-you-use model. Let me show you here in these CloudWatch graphs what it looks like to use either provisioned or on-demand mode. You can see a workload here where we are happily handling about five shards provisioned. We are happily handling a load of about four megabytes per second. Then there's a traffic spike.

You can see the amount of throttled records, which is the upper line there in that top graph, jumps up to about 60 percent. That's not good. You're dropping data on the floor. In this case, what happened was we enabled on-demand capacity, which is something you can do for an existing stream. You can see that automatically the system detected that the load on the existing shards exceeded the provisioned throughput. It split the existing shards and provisioned new shards. The amount of throttled data on the ingress side drops down to zero. This is super simple. Within a few minutes, totally hands-off, the scaling problem was addressed for them.

Introducing Kinesis On-Demand Advantage with Warm Throughput

Some of the challenges that builders such as yourselves have had, that we've heard about building streaming applications, are that you want predictable performance at scale. You want to be able to have costs that are predictable as you scale up and down, especially with spiky workloads. You want the performance envelope to be quite consistent. These are all considerations if you're trying to build a streaming application. One of the things that we launched last month is Kinesis On-Demand Advantage. There are two key features that this new version of on-demand provides. The first is the ability to specify warm throughput. If you know ahead of time that you're going to have an increase in load, instead of having to overprovision yourself and pay for potentially idle shard hours, you can just say, look, I'm expecting this amount of data to be coming in at my peak.

I'm just going to give you a heads up that I'd like to have this amount of warm throughput available for when I need it. It's really simple to use. You simply enable it. You can either create the stream using the command line tools or in the console. You can just specify what you would like your throughput to be. Or you can even specify this on an existing stream.

As a feature of Kinesis Data Streams, on-demand has a significant advantage in that it's about 40% cheaper than regular on-demand pricing. The reason we've been able to achieve this price point is that we did some work to determine that with committed workloads, we could support this pricing for customers who have all the reasons to want to use on-demand with reactive scaling and the pay-as-you-use model. However, if you're able to commit to a certain amount of usage, we can provide it to you at a much better price point. The on-demand advantage is really an interesting thing to look at, and I would suggest you take a look at it in the Kinesis Data Streams documentation.

Large Record Support: Handling Up to 10 Megabytes with Token Bucket Throttling

The next thing that customers have had challenges with is large records. If you needed to ingest data larger than a megabyte, typically what you would do is store it in another data store, such as Amazon S3, and then store a pointer in your data stream to that record. Your consumer would then understand that when it sees this pointer, it has to go and fetch the record. This presented a couple of challenges. First, there's a bunch of complexity that you'd have to build into your consumers. Second, the amount of data going through a particular shard isn't actually representative of the amount of work that has to be done by a particular consumer. You can get a fair amount of imbalance between your different consumers because these larger records are opaque to the scaling system and the balancing systems, and you can end up with some pretty significant stalls in your consumer performance.

What we did was add large record support. We've increased the maximum record size by a factor of 10, so you can now send records up to 10 megabytes in a single shard. It's super easy to enable—it's just a setting on your stream—and there's no additional cost to use it. One thing that I'll explain here, which is kind of interesting, is how this works with the 1 megabyte write and 2 megabyte read limitations of an individual shard. How would you be able to write 10 megabytes of data to an individual shard with these throughput limitations? I'll get into that in a second.

The way we actually do the throttling for an individual shard is with something called a token bucket. We keep a bucket of tokens where, as you're using throughput per second, we deduct from that bucket the amount of tokens you've used and we replenish it at a certain rate. In this case, these buckets replenish at a rate of about 1 megabyte per second. When a large record comes through and you consume in one big burst, all of the available tokens are consumed. In this case, there were 1010 tokens in this bucket. That's okay because that was able to successfully go through because we had that burstability and those microburst allowances within the model that we still have for the hard provisioning on the shards. You can see there that we've exhausted our bursting capacity and we're able to continue to write at the 1 megabyte per second level.

But what would happen if another burst comes through? Because there are no available tokens in the bucket beyond the 1 megabyte per second that we already have, those would be throttled until you back off a little bit on your throughput. In this case, you can see we backed off to about 0.5 megabytes per second, and then that token bucket refills at a rate of about 0.5 megabytes per second. This is a little bit of a subtlety to how to think about the large record feature and how it works, but we think this is a pretty seamless way to be able to use it.

Live Demo: Social Media Sentiment Analysis During a Viral Launch Event

One of the things that Ali and I thought was quite humorous when we were working together for this talk was that we found out we were both parents of twins. It's an interesting thing to be working on a horizontally scalable service such as Kinesis Data Streams and be a twin parent.

Being a twin parent, we've had to both learn how to scale ourselves horizontally, while our children have also scaled horizontally. I'd like to take you through a demo to stretch this analogy a little bit further and show you how we can use Kinesis Data Streams to handle many things going on at once, such as the chaos that happens in my household.

Let's imagine we're all building a new product and we want to do a launch. One of the things that is really good to do is to follow the social media sentiment of your launch. We've put together a social media sentiment analyzer. I'll remind you that this is a demo that you can actually check out for yourself. It's on GitHub, and I'll put a link to the demo in a bit.

The scenario is this: at 11 a.m., we're going to launch and make an announcement, and there'll be some amount of uptake. Then at 5 past the hour, a celebrity picks us up and reposts it. All of a sudden, the activity around this launch starts to mushroom. Eventually, we reach this viral moment where about 45 minutes after launch, we're seeing an insane amount of engagement and our social media analysis streaming application is really trying to keep up with all of this. Then eventually, the hype bubbles down and we're back down to a baseline of traffic. That's what we need an architect for.

Let's take a look at how we would do this. On the left-hand side here, we've got our event producer, which will be the data feed from the social network. We're feeding this data, so the producer of the data is sending this data into Kinesis Data Streams. We then have a stream processing consumer here, which in this case, we built using Lambda. This could be doing all sorts of things like using a large language model to do the sentiment analysis for you. Then we store the data results in a data lake later for using for analytics and for also triggering further events. We'll use Data Firehose to put that data into Amazon S3 tables. This is phase one, after the initial launch.

We're seeing about 100 messages a second, which is nice and easy to handle. Nothing too bad. We're well scaled for this particular load. I'll just show you here the results of a ListShards API call. You can see we've got 4 shards open. Everything's going great. Now our celebrity picks us up. Their followers are alerted to this launch. You can see what happens is we see this pretty significant increase in engagement.

What's actually happening is the incoming records are being dropped on the floor because we're seeing more input from our producers than the stream is scaled to be able to take. However, we are using on-demand, so the stream does scale up automatically. Eventually, the throttling records subside back down to baseline. You can see here that the original 4 shards that we had were split, and we created some new shards, which is great. This is all handling the scaling for you.

As we hit the viral moments, we split those shards even more. We end up with the peak of our sentiment records coming in. This particular workload is now scaled up to about 50,000 records per second coming in of social media posts that we need to analyze. By the end of all of this, we've automatically scaled the stream up to 128 active shards. But you can see all the shards that we've split there and closed.

All told, we processed 55.1 million records, but unfortunately, we dropped 2 million on the floor. There was a bunch of data that we lost. Unfortunately, in this case, due to the initial scale not being enough, that's how our social media sentiment analysis application fared.

Now, if we were to do the same thing using On-Demand Advantage with our warm throughput, what we can do is we can go into the console and we can specify what we would like in terms of warm throughput for the stream. It's just a setting and you specify it in megabytes per second. This is a really neat tool here in the console where you can actually use either test data or even production data. You can get an idea of what you should be setting your warm throughput to.

So you can say, well, I expect this much load, what should I set it to? This helps you through that process, which is quite handy. So we start doing this with on-demand advantage. You can see we start out at the gate with 128 shards. Even though we've only got a very low amount of traffic at the beginning, we see our traffic spike up when the celebrity shares this on their network. As we get to the end of the result here, we were able to process 57.2 million records. We didn't drop anything on the floor. We got a much cleaner signal from the social media analysis. You can see that we didn't create any new shards. We were perfectly adequately scaled up to be able to handle this load.

I think that was a really interesting demonstration of how warm throughput can help you with these business events that you can anticipate, such as a launch. As I mentioned, this is available on GitHub. If you're interested, you can take a snap there of the QR code. You can check it out and run this for yourself. It's a really interesting example of how on-demand advantage can make your peak events a non-issue for yourself.

Under the Hood: Innovative Shard Splitting Architecture and High Availability Design

I'm going to hand over to Ali to take you through the rest of the under the hood details. Thank you very much. Awesome. John mentioned a lot of amazing features about Kinesis Data Streams. As John mentioned, I'm a streaming architect with AWS and I work with customers every day. Very often I help them set up, configure, and fine tune their self-managed streaming storage according to best practices so they can scale when they have a large event. Sometimes they need guidance, and then we help them scale their infrastructure.

Very often, scaling from 1 gigabyte per second to 10 gigabytes per second needs weeks of planning. It became a point of curiosity for me how a Kinesis stream can do that without operators doing anything. That's very interesting. So I had a conversation with engineers. I reached out and with John's permission, I got connected with some of our amazing folks and I asked, can you walk me through how you designed this system and how this is different? They shared a lot of under the hood facts with me, and I'm here today to share all those under the hood details with you for the first time.

Just a refresher to John's part, Kinesis Data Streams principles are that it should be easy to use, it should be highly available, secure, durable, with low latency at the very largest scale. Millions of data streams and also hundreds of millions of requests per second. Engineers shared with me one of the first difficult decisions they needed to make was how the system should scale. John mentioned that a shard is a unit of scale, and then we horizontally scale up, right? Easy, is it?

Well, one implementation, which I know other streaming storage also works like this, is to simply add shards, right? So we have shard one, and then we just add another shard. How many problems are you able to identify with this? One, two, three. I identified three. For one, what I'm showing you on this slide, you can see that the messages could get out of order. Why is that? The messages have the same key, by the way. Well, why is that?

So if you pay attention, when we introduced the shard, there was a part that shows the producer uses the hash algorithm to encrypt the key with the hash algorithm. And then depending on how the hash algorithm divides into the ranges, and then each range maps to each shard, the message will belong to that shard and will be sent to that shard. That's how the system distributes messages, right? What happens when we go from one shard to two? Well, the shard range splits, then the messages that all belonged to one shard range now could belong to another shard range. So then message C with the same key could end up in a different shard. The consumer will read them out of order.

Problem number two is, are we supposed to scale back in? Can we just go from two shards to one now? No. What are we going to do with the data? Are we going to just read that data somehow?

We cannot simply read that data and merge it into another shard. There is another problem. When we want to balance the load on the back-end nodes, which I will discuss later, we end up with shards containing so much data that a shard could easily reach 10 terabytes or more. How are we going to balance this load across different worker nodes? We would need to copy that data from one node to another and then to another. This takes a lot of resources, consumes a lot of time, and violates the principle of being predictable.

Engineers were very innovative in their approach to how the shard actually scales. The shard does not simply add another shard. Instead, it splits the shards. What this means is that the active shard, also called the parent, will close. Two new shards will be added, also called children. The consumer first reads the messages from the parent, and then it reads messages from the children.

When we take a screenshot and describe the shards of a stream, this is how it looks. The parent shard is closed, and two new shards have been opened. Similarly, when we want to scale back in, two previous parent shards close and then the children's shard will be added. All the messages from the producer will come to the new shard, the children's shard, and then the parent shard closes. The consumer can consume the messages from the parent shard until they are finished, and then we can split again and then we can merge again.

One good thing about this architecture is that it solves the problem of messages getting out of order. With this approach, we can scale back in without any data loss and without any problem with message ordering. Additionally, each of these shards could be in a different worker node. If we have a shard that does not split or merge and the size is reaching some limit, we can decide to close it and move the rest of it to open a new shard on another worker. This way, we do not need to copy the data everywhere because we are doing even load balancing on the back end.

Here is what happens. The workers are actually those storage nodes that persist your data. Because the workers are in three availability zones, you see the three columns for them. Once the shard closes, a new shard could open on a different worker with replicas on other workers, and then the data will go there. The previous shard is there but it is closed. The producer cannot write data to it, and then the new workers accept the write. The reader will first read the parent shard and then comes to the children's shard.

Another principle that was mentioned is that Kinesis Data Streams must be highly available. High availability means that for any copy of data that you are writing to Kinesis Data Streams, Kinesis Data Streams actually stores three copies of it and puts them in three different availability zones. Why? Because each availability zone is a containment zone of a failure. So the data will end up in three availability zones. In the event that we lose the entire availability zone, still two copies are available.

With the previous architecture that I mentioned and also the fact that the nodes are replicated across three availability zones, there is some complexity. One of these nodes is always going to be the leader, the node that accepts the write, and then two other nodes are going to be replicas. When a consumer and a producer want to read or write, they need to connect to that node. They need to find a way to connect to it. Therefore, they need to do a discovery. They need to find out how to connect, what the DNS record is, what the IP address is, and how to connect to it, and then they make a connection. If the replica or a leader fails and then another leader takes over, they need to also know that and then can make a new connection. This violates another principle that Kinesis Data Streams needs to be easy to use.

Therefore, engineers built the entire front-end fleet and abstracted all that complexity away from customers by exposing a single API. Through a single API, you would be able to collect data across various sources and bring that into your network infrastructure. So you don't have to manage network infrastructure, you don't have to worry about the scale of that infrastructure, and you don't have to worry about any DDoS attack or any bad action that could happen for the APIs that you need to build. That API is available and takes care of auto-balancing and load-balancing data across multiple availability zones.

The Front-End API Layer: Cache Service, Membership Service, and Timer Service

Now that it is the responsibility of the Kinesis data stream to figure out how the request coming in from a customer gets routed to the back-end nodes, engineers had to build three internal services that are hidden from you: a cache service, a membership service, and a timer service. We'll talk more about those. There are some customer responsibilities as well. The consumer and producer responsibility is to be aware of the shards, to know that each Kinesis data stream has shards underneath, and then for a producer to load-balance the data across many shards without exceeding the limits of the shards. For the consumer, the responsibility is to be aware of the processing threads, which one is reading from which shard and doing some sort of a process that we call lease management, but we'll talk about that more.

Kinesis data stream will continuously monitor all aspects of the back-end nodes. If you add another consumer, with other streaming storages, you need capacity planning all over again because you need to make sure there are enough network throughput and disk throughput to sustain the throughput that is coming in and still have buffer for all the background operations and also have enough capacity for an additional consumer to read the same data. With Kinesis DataStream, you don't need that because Kinesis DataStream has a feature called enhanced fan-out that recently launched to support up to 50 distinct consumer groups off of a single Kinesis data stream.

Let's talk about that API layer a little bit more. For those of you who raised your hand and said you're using Kinesis data stream, thank you for that. But the chances are the data that you're writing today is stored in the same node as somebody else in this room today. So the API needs to make sure every request that is coming in is authenticated and the user who's trying to access has access to the resource that is indicated. It also needs to make sure that it's not exceeding the quota and overdriving those back-end nodes, and it needs to make sure that the data is encrypted with encryption in transit or TLS and also encrypted with the customer-managed key or with the AWS key on the back end.

If you have a requirement for using IPv6, a single API as part of a dual-stack API will give you IPv6 or IPv4 through the AWS SDK, and you can enable that. If your requirement requires you to use FIPS endpoints in the US government cloud, you can access Kinesis data stream with FIPS endpoints, so you don't have to build that. If you want to do access control, you have access to resource-based policies for controlling the access to the resources at the stream level, and you also have access policies through IAM, AWS Identity and Access Management.

So the front-end service, each of those front-end services does this hundreds of millions of times every second. Let's see how it does it. These APIs need information regarding which member wants to access which resource and then which shard belongs to which stream and which member. That information also indicates who is the leader node and who is the replica node, and that's how the routing happens when the request comes in. A naive approach is to just store that in the database and then every request that is coming in, we just go and load it with the shard ID or with the Kinesis stream ID and then load that information from the database and then find which node we need to route the request to and then do that.

However, we know this design doesn't work. Why is that? Because if we are building a data stream for massive scale, then we need a database service even bigger than that because we need to make sure that the service is reliable. Also, the latency wouldn't be very low because databases tend to have double-digit latency, and then the API latency will be even higher than that. So like any other engineer would say, let's put a cache in front of it.

Would this work? Well, this definitely improved the performance, but not always. What if the cache fails? Then we need a cache service that is giving us capacity that is even bigger than the data stream, and also the cache could get out of sync with the database for a variety of reasons. So engineers went back to the basics of how a database actually works. If you think about it, what a database does for us is give us a consistent view or a snapshot of all the change logs. For all the change logs that are coming through, it will give you always the latest and accurate snapshot at the time that the read occurs.

So engineers thought, what if we just replicate that to all the different components, the front-end host and also the cache layer? Then each of these components independently can build that snapshot, so when the request comes in, we can just read that information from the cache in the memory of the worker. Does this design work? Not quite. We're getting there, but not quite.

The problem with that design is we are operating a distributed system, and this distributed system could go out of sync for many reasons. A node could fail, a change log could be replicated, get processed twice, could get processed out of order, and the system could get out of sync. So for that reason, engineers added additional information to the shard map entity to indicate how long this record is valid. If a node has any record, it also checks if this time has passed. If this time is passed, that means this record is stale, so it needs to go and load the new record.

Does this design work now? Still, there is one more wrinkle that we need to iron out. The time runs differently across distributed systems. So when they want to compare the time, they should compare it with a reliable clock. What is the reliable clock? Is the system time at different components reliable enough? No, milliseconds matter here. We're talking about hundreds of millions of requests every second. If you get a request that is going to the wrong node and you get an error, and then you have to retry again, that's not a reliable system.

Therefore, engineers built another service called the Timer Service. The Timer Service is the source of truth for the time across all these different components. It doesn't matter what system time they're showing on their system; they need to compare that time to live with the Timer Service. They constantly pull the Timer Service in order to refresh that, and then they will use that to compare with the time to live. If that record is stale, then they will go get it from the cache. If it's not in the cache, then they will go get it from the database. Does this design work? Yes, this works.

Consumer Best Practices: Kinesis Client Library and Lease Management

Now let's talk about the consumer responsibilities. As I mentioned, consumers need to be aware of the shards because we are processing the data in a distributed way as well. There are certain aspects that the consumer needs to manage. One is the state of which processing thread, which worker is consuming from which shard. The mapping between the shard and the consumer's instances or consumer processing threads needs to be kept somewhere and then that needs to be constantly monitored and managed. So there are some best practices for consumers. If you're writing your own consumer application from scratch and you want to write your own client library, make sure that you do proper handling and make sure that you monitor your clients and you are doing a good job of keeping track of the state.

If you're using any of these native services or third-party applications, there is already integration between Kinesis Data Streams and these applications. Because of that native integration, you don't have to write any code. As soon as the data lands in the Kinesis Data Streams, you can consume it using any of these services or third parties.

If you need to write code and do not want to use AWS Lambda, the Kinesis client library takes care of all the heavy lifting and state management. Unless you're really interested in doing this yourself, the Kinesis client library is the easiest way to package it with your code and write custom code to consume from a Kinesis Data Streams.

Recently, the Kinesis team released version 3 of the Kinesis client library. There are so many enhancements in it. One of those enhancements, my favorite, is that it detects which one of the processing threads are overloaded and then it even balances the load. It basically takes the lease from one worker and gives it to another one. It even balances the load across workers, which will return cost benefits to you.

When you use the Kinesis client library, it manages the state across three separate DynamoDB tables. The state table keeps track of which worker is the leader. The lease table keeps track of which worker and processing thread is consuming from which shard and how that subscription and mapping looks like. Then there is a metric table that Kinesis client library version 3 uses in order to know which worker is overloaded and then even balance it. It also emits all the client-related metrics to Amazon CloudWatch for you, so you don't have to worry about that integration as well.

On the left-hand side is the worker which is in charge of polling. On the right-hand side are the processing threads, and these threads are decoupled from each other. Those who have done multi-threaded programming know how difficult it is to run multi-threaded programming without any memory leaks and other issues. There is also a scheduler which acts like a timer and runs everything.

Here is how a lease management record looks like. It has a key and it has a checkpoint. Each of the processing threads, as they progress, frequently update the checkpoint, and the checkpoint information is stored in the lease management table. It also has a lease expiration time, similar to a time-to-live. The purpose it serves is very crucial. When a lease record is not updated, meaning that this timestamp is not up to date, it means that the processing thread has failed silently. The system detects it and then another worker will take over. Whoever wins, the first one obviously who takes the lease, will take over from another worker.

When the leader fails, another worker basically will take over. If a worker fails, the leader could take a lease from the previous worker and give it to another worker. That's how a distributed system and a stateful distributed compute system should work, and it works with the Kinesis client library.

Producer Best Practices: Batching, Aggregation, and the Kinesis Producer Library

Now let's talk about the producers. Producers also need to be aware of shards in a different way. Is it efficient if a producer makes an API call for every single record? When I speak with many customers, they acknowledge that it's not efficient, but they do it anyway because they think they get lower latency this way. It turns out network calls are not cheap and not free. Network calls have overhead. Every single request needs to be authenticated, checked for access control, checked against access limits, and then routed through. If you're sending one request for every single record, and by the way that record is very small, you're not using it efficiently.

We recommend customers batch up the records into larger payloads and then, if it is possible, implement asynchronous operations.

Kinesis Data Streams have two APIs for writing records: PutRecord and PutRecords. We recommend using the PutRecords API whenever possible in order to send multiple records with a single network call. With PutRecords, you can put up to 500 records. What should be the size of that record? We'll talk about that more.

If you're using native services or third-party services with native integration, there are native connectors that will help you write data to Kinesis Data Streams. Those best practices are already built in. But if you're writing your own code and you want to call the APIs, make sure that you also do aggregation if you have very small records. We recommend the record size should be around 10 kilobytes. If your record size is much smaller than that, one technique is to build larger payloads out of smaller records.

Let's say we have a clickstream record which is only 400 bytes. We can aggregate them into a 10 kilobyte batch, and then we can even compress that so we can pack more clickstream records and compress it to become around 10 kilobytes. Then we can use the PutRecords API in order to send all of that in one network call. Although it is counterintuitive, you will get better latency and better throughput this way because if you have a very large throughput of clickstream data which is small, the network overhead is going to be the bottleneck for all that throughput to move through the pipeline.

If you do not want to worry about these best practices and want to ensure they are always followed and correctly implemented, you can use Kinesis's Producer Library. That's another library which is open source, and I'll provide the link in a moment. You can use it to aggregate all the smaller records into larger batches and even compress them. There are options that you can enable or disable. Once you put them in the larger batches, it has a buffer to buffer them, and the buffer is reliable. If a node fails, the system ensures that the records are there.

The collector and the retrier and limiter will help you when you hit throttling. It retries to make sure that the buffer gets flushed into the Kinesis Data Stream. The limiter ensures that you're not exceeding the quotas and limits that the Kinesis Data Stream imposes. It emits metrics in terms of how many records get throttled, how many records went through, how many records were successful, and what the latency of the pipeline is into Amazon CloudWatch. It can reliably work across many instances of your producer.

Additional Resources and Call to Action

I will leave you with additional resources here. If you want to get more familiar with Kinesis Data Streams and want more details, you can go to the first QR code. The second one is the demo that we shared with you today, which will take you to the link to the GitHub to try out the demo. If you want more information in terms of reading the code or finding the project of the Kinesis Producer Library or Kinesis Client Library, you can go to those two QR codes on the right side that will take you to the GitHub, and you can check those out.

What I would like you to do after this session is try out our demo and see how Kinesis Data Streams can deal with the peak of going from 10,000 to 100,000 or 500,000 or even millions of records per second. Monitor the whole pipeline in a live demo in your environment that comes with a lot of the best practices we shared today. You can reuse a lot of those components if you're writing a similar application or want to build a serverless pipeline that scales out by itself using that as a baseline for your projects.

Another call to action is please fill out the survey that you will receive a link for because those surveys are really important for us. We use your feedback in order to improve the content and improve these sessions for the future. Those surveys are always important for us. With that, I would like to thank you all for coming today, staying late with us tonight, and sitting through this presentation. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community