Kazuya

Posted on Dec 8, 2025

AWS re:Invent 2025-Disagree in Commits:The Performance Improvements That Cut Costs by a Third-OPN309

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025-Disagree in Commits:The Performance Improvements That Cut Costs by a Third-OPN309

In this video, Madelyn Olson from Amazon ElastiCache and Corey Quinn discuss how AWS's journey with open source led to price reductions. They explain ElastiCache's architecture running open source engines like Redis, Memcached, and Valkey, detailing technical challenges like implementing TLS encryption that initially reduced throughput from 130,000 to 80,000 requests per second. After Redis changed to restrictive licensing in March 2024, the community forked it as Valkey under BSD license with Linux Foundation governance. Valkey 8 achieved significant improvements: scaling from 0 to 5 million requests per second in 15 minutes versus 70 minutes previously, and delivering 41% memory savings through Ericsson's hash table optimization. These technical wins translated to AWS price cuts—ElastiCache Serverless became 33% cheaper and node-based deployments 20% cheaper than Redis equivalents. Ironically, Redis now pulls features from Valkey, including the slot statistics AWS originally proposed. The presentation demonstrates how open source collaboration benefits everyone, with Quinn showcasing a Valkey-powered AI tool costing just 54 cents monthly.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: How Open Source Contributions Led to Lower Pricing at Amazon ElastiCache

Alright everyone, welcome to Disagree and Commit. This is going to be a story about how an AWS service learned that it was so great to contribute to open source for our community and our customers that we decided to lower pricing because of it. My name is Madelyn Olson. I'm a Principal Engineer on the Amazon ElastiCache team, and I'll be showing you a little bit behind the curtains of my service ElastiCache and our journey through open source. I'm joined by the illustrious Corey Quinn, everyone's favorite AWS critic, as he provides a fair and balanced economic perspective.

That's a super polite way to say that Madelyn is very smart and is going to talk about intricacies of the service under the hood while I stand next to her on stage and yell jokes from time to time. My programming languages, I am only very good at two of them, which are brute force combined with enthusiasm. But now that you can cyberbully a robot, it moves mountains. I've been using Valkey and its downstream proprietary fork Redis all the way since 2011, so I have opinions about a lot of this.

Many people warned me about Corey Quinn, but I think he's mostly harmless. So without further ado, let's get started. I'm going to talk a little bit about Amazon ElastiCache and what it looks like behind the scenes, and the types of services and coding and projects that we use to help build the service. Honestly, learning about this was my whole reason for wanting to do this talk, because I always wondered what actually happens under the hood. Well, this is a good way to find out.

So I'm going to talk a little bit about what Amazon ElastiCache looked like when I first joined the service. You as a customer will come to our service with a VPC and we're going to provide a caching endpoint for you. Amazon ElastiCache is the managed caching service that provides a bunch of different underlying engines. So the service is ElastiCache, but we provide endpoints compatible with Redis, Memcached, and soon to be Valkey.

A lot of people think we have a very complex system behind the hood, but it's actually very simple. When you go to Amazon ElastiCache and try to create a cache cluster, we will create a VPC dedicated just for you. We will create some EC2 instances which we call cache nodes. We'll install a little bit of management processes on here. These are responsible for health checks, making sure the disks are okay, making sure parameters are getting updated, all of that boring undifferentiated heavy lifting that we love to talk about.

And then we install the cache engine. Yeah, one of the things I always like to see is that between Availability Zones every time data passes between them, a product manager gets their wings, or at least a small yacht to go with the rest of their yachts, except when it comes to things like ElastiCache where replication traffic is not metered or charged, which in many cases is a terrific way to get around some of that cross-AZ data transfer. Why it's like that, only the gods know, but it is convenient and we can take advantage of it.

Then we attach an ENI to your VPCs where you can send your cluster and your application traffic. What's important here is that these cache engines are for the most part just open source engines. As I said, we support open source Memcached. There is actually maybe 100 lines of code that differentiates our Memcached from open source Memcached. Now, Memcached is an open source license that's permissive. We're allowed to modify the code, and when we find bugs, we do report them back.

The TLS Challenge: Building Encryption Support and the Vicious Cycle of Private Forks

Additionally, the same is also true for Redis. The core infrastructure of our service was based on Redis open source back in the day, but we did have to make some custom changes. So I'm going to talk you through how we implement one of those changes. One thing that our customers demanded from us was TLS support. Well really, you had your CTO prancing around on stage with a shirt that said encrypt everything, and then I actually had one made one year that added the same shirt with a parenthetical underneath that said unless it's hard, but apparently that's not the way it's supposed to go.

So yeah, we did demand TLS. Encryption is a best practice. It is a best practice, and our customers needed it for things like compliance as well as their own internal security teams demanding it. Customers wanted to come to us and they wanted to have all their traffic encrypted. The best practice at the time, this was about 2018, since Redis open source and Memcached open source neither of them supported TLS, was to use a TLS proxy. So all of your traffic is sent first to the TLS proxy, it's decrypted, and then it's sent in plain text back to the underlying caching engine.

Yeah, there's always the question, not end-to-end encrypted, where does TLS terminate? It's like, well, if I set up the infrastructure generally on the floor, which is why it was nice to be able to offload that undifferentiated heavy lifting to people who are good at infrastructure. Exactly. We were running these TLS proxies on the host themselves, and that was a good way to make sure that the actual traffic between the cache hosts and the TLS proxy was encrypted.

We had all the infrastructure for this, we had our control plane, we had our management processes, but there was a problem. The problem was that it was very expensive and slow from a CPU perspective. It was almost as expensive as not using TLS at all. In the short term, that is less expensive. In the long term, it is the exact opposite. But that's okay. That's why CISOs are best known to be ablative. We have to get rid of one because we had a breach, and we'll replace it with a new one. Sometimes you get away with blaming interns for that, but that's harder these days as people are waking up to the fact that maybe the intern didn't make the strategic decision. The LLM did it now.

So, the base version of Redis could do about 130,000 requests per second, fully utilizing its single thread architecture per core. When we tried to use TLS with the proxy, we were only able to get about 80,000 requests per second, and we were primarily bottlenecked on that TLS proxy. That intuitively makes sense. Redis is primarily a hash map attached to a TCP server, and all a TLS proxy really was was two TCP servers talking to each other and doing TLS in the middle. So it's actually doing a lot more work than the underlying engine. We did some prototyping work and we figured out that just directly embedding TLS into Redis itself was significantly more performant. We were able to get better throughput with less resource consumption, and this is what we want to do for customers at AWS. We want to build things that are best for them that utilize our hardware.

So that's what we ended up building for our managed service. We built TLS directly into our fork of Redis open source, and that's what we basically had to maintain going forward. We built this when Redis was only on Redis 3, and we've been maintaining it almost for, we maintained it all the way until Redis 6. Throughout that time, we launched a lot of features, and many of these features don't seem like they have anything to do with TLS, but they still had merge conflicts we had to deal with. Fun fact, a collective noun for a group of developers is in fact a merge conflict.

So we had stuff like, you know, Redis has a system called clustering, and the fundamental unit of partition in that is called a slot. We had a mechanism to move slots around. Obviously that has network involved, so we have to encrypt that traffic as well. So that's more code that we had to write and maintain over time. But there was also a feature in Redis called dynamic command renaming, which was a very odd way to make sure that you don't execute very dangerous commands like flush all and keys. That mechanism did not work correctly with how we built our TLS configs, and we had to resolve the merge conflicts every single time a new version came out.

We also had to do stuff like password rotations, online data migrations, which is migrating data into ElastiCache, and cross-region replication, which are features invariably AWS always implemented cross-region replication the week after I finally wound up patching something together for a service that didn't have it because I needed it. And they're like, oh, that was great. Now take a big sip of coffee and check the AWS blog today to see what's new. Yeah, did not go well. But honestly, the best way to get a feature done was to have me do it and then they would be right along presently to make my work look like the crap that it was. So all of these were features we built between 2018 and 2020, and every time there was a new version of Redis, we had to go and figure out how to maintain these conflicts.

And so this is what I like to call the vicious cycle of having our internal private fork. You know, we only have a fixed number of engineers, and if all those engineers are spending time on is handling merge conflicts, that means that we have less time to spend any time upstreaming work. All we end up spending time on is just building more and more internal features. So we build more features internally, which means we spend more time merging those changes from upstream, which results in us not having enough time to upstream them, which means we build more and more features internally. This is true of a lot of services inside Amazon. A lot of times we want to contribute. This was also compounded by the fact that at the time that I joined, the company actually took a long time to basically get all the approvals needed to contribute anything to open source. The very first bug I ever contributed, I had to talk to IP lawyers. I had to talk to our GM, and I also had to talk to our business line lawyer to basically figure this all out.

Breaking Free: A Simple Bug Fix Reveals the Power of Open Source Collaboration

So let's talk about a slightly different process that evolved and why I thought it was a lot better, because it was very important for me to contribute to open source. So I want to talk about a very simple bug. So Redis supported functionality called replication, so you have a primary and it replicates data eventually consistently to a replica. We talked a little about TLS. We also had authentication on that pathway as well, and that was done through the AUTH commands. So the replica would send AUTH command to the primary with the password configured with.

I was one of the people that built this functionality, and while I was testing, I found out that most passwords worked, but the product definition I was given said you can use anything in the password except at symbols and null characters, which include white space. So while we were doing aggressive testing on this functionality, we found out that if you put a space inside a Redis password, it didn't work, and the replication never succeeded. The most sensitive white space should really only be restricted to your company's board of directors.

It's also important to show that when you're testing things like passwords, you have a variety of different things that test it off. One of my personal favorites that I have in my test list of passwords to test when I'm doing authentication-based things is the EICAR string, a file, a string that's designed to trigger virus detectors without actually passing live malware around. In theory, absolutely nothing should break. In practice, the fact that I mentioned this on a conference stage should tell you a lot about how many things tend to break. Yes, a lot of things do break.

One of my goals was I didn't want to maintain this code internally because it was actually on the same TLS path. It was on the same connection establishment path. So I was like, hey, let's go fix this. I found the code base that we were actually using, one of two different protocols inside Redis. Redis has what's called the inline protocol, which is a bunch of space-delimited commands, as well as the RESP2 protocol, which is all binary safe, including supports things like spaces. So I was like, alright, let's just have special handling for the inline protocol so we send this auth command correctly. I submitted this pull request after going and talking with legal and getting approval, and I got a great response back from the maintainer of Redis at the time, Salvatore Sanfilippo, and he's like, no, let's move to this RESP2 protocol. It's much more robust. It's what we should be doing.

So this is a great success about open source. We found a bug. I suggested a fix, and I got information from someone who knew the system better than me to make it better. And why would you want to maintain that yourself? Well, we figured that our ability to have passwords with spaces in them as a competitive advantage is ridiculous. It's more burden on you to wind up dealing with to support and maintain the fork while everyone else keeps tripping over the same thing but doesn't know why. Exactly. And I think that gets into why I think it's actually very why AWS has increasingly been working to contribute more and more to open source over time. When we make open source better, that means more engineers will choose to use open source on AWS, which is just a great flywheel, right? It is simultaneously altruistic and self-serving. It's the right way to do it. It's a rising tide lifts all ships. Exactly.

Evolution of ElastiCache Architecture: From Custom Code to Community Contributions

Although at the time it took a lot of effort to contribute, we actually started contributing a lot more. Part of my time, I was given almost 50% of my time in Amazon ElastiCache to just go contribute fixes, and the first big thing I worked on was this TLS work I talked about before. I want to talk a little bit about, I was three years out of college at this point writing C code. I was not a great developer, but I wrote a lot of code like this. That's my job. I'm the crap developer. Especially in C code, there's a lot of if statements like, hey, if TLS is enabled, do X. If not, do Y.

When I went to the Redis conference, the yearly conference for developers to help evangelize, hey, we really need encryption in transit inside Redis, it's very important, the maintainer I talked about before kept saying your code's not good, it's ugly, it doesn't work very well, we need a better solution. But I am nothing but persistent, and I kept working with folks, and eventually we did find a solution. We ended up building sort of this connection abstraction that hid away all of these low-level details, and this makes it seem like it was very simple. It actually took a lot of effort to get a connection abstraction just the way we needed it, so that we were able to completely abstract away all the TLS work that we had to do. And as a recently calibrated crap developer, one of those is a hell of a lot easier for me to understand what it's doing and why and get to an answer than the other one is. Readability is important.

Readability is really important, and this is a great success story. We, being AWS, did not end up actually contributing the code that was finally accepted. That was actually done by Redis Labs at the time, now Redis Limited. But this is still a great success because it's open source working as a collaboration and not just one group dumping code and being like this is how we want it to look, right? I was actually strongly opinionated about what the code should look like, and the community wanted something else, and so we all came together and built what we wanted as a community.

So I want to reiterate again because I think it's important. At Amazon ElastiCache, we want to only maintain code internally that helps us run our managed service. So we want to improve stuff like bug fixes. We want to improve reliability. We want to contribute performance optimizations.

This is all stuff we want to freely contribute back to the community. We want to contribute upstream. We don't want to keep it internal because then we get expertise from the maintainers, we help the community grow, and we help make the system more reliable and sustainable. At the end of the day, once we want to help maintain it in the community as well, because if there's a bug that's introduced later, we'd much rather catch it and find it inside the community than when we're taking it and then applying it back on our service.

I want to be clear that "we" on this slide is not just AWS. By now it should be wildly apparent that I do not work for AWS. The lawyers and the AWS sniper hidden in the rafters right now are very clear that I need to say that and make that clear. But "we" as the community, as a larger whole, the ecosystem which we're all operating in, this has emerged as a consensus decision. This is the model that we want to see. It is the open source virtuous flywheel.

Let's fast forward a little bit. I'm talking a lot about 2018 and 2019. Our service has evolved a little bit since then, but the core idea is still the same. In 2020, the previous maintainer of Redis stepped down, and I became a maintainer of Redis for a time, and that helped us contribute more to open source. So let's look at what our service looks like a little bit now.

We still have these customer VPCs. We've evolved from ENIs to VPC endpoints, the new trendy technology, so much faster to provision. We now use an NLB, and this NLB helps us balance the traffic that you send to our clusters through a proxy fleet, which helps us scale more quickly and more intelligently. You'll notice that instead of having a per-customer VPC, we have service VPCs. This helps us provision more quickly so we can more dynamically adapt to incoming scale. But at the end of the day, we're still running the exact same cache nodes I talked about earlier, which have the same management processes and have the same open source engines running on top of them. Although we have fancy cluster routing in our proxy fleets, we still at the end of the day deeply rely on this open source technology working well.

Serverless Scaling and the Rejected Cluster Slot Statistics PR

I would also like to point out that this talk is about ElastiCache, which is kind of a database, but this slide features Route 53, which is definitely a database. I want to talk about a similar problem we had with TLS, which is how do we handle spiky workloads in serverless. Traditionally in caching, you might provision to handle a throughput like the one you're seeing here. This is sort of emulating a ride delivery service for food. You'll see a small spike during the lunch hours and a bigger spike during dinner.

Traditional caching always wants to be able to serve data, so you have some amount of buffer on top of the high water mark. But the engineer in you should be unhappy with the fact that there's so much overprovisioning, though you sort of have to on some level. Historically, autoscaling was great. It was the capacity you need 20 minutes after you needed it when something hit unexpectedly. But you saw this pattern even in places with highly predictable workloads. People would say, well, we want to make sure we never run into a capacity problem, so they go for the absolute highest of high water marks and never scale things back down. Either you're smacking into performance limitations or you're overprovisioning and paying for the privilege as you go. It's an optimization opportunity, and this was the undifferentiated heavy lifting that we wanted to solve in ElastiCache, and that's why we built Serverless.

The way Serverless scales is it still has some buffer because we want to be able to handle the instantaneous traffic that will always happen throughout the day, but we try to stay much closer to the actual throughput requirements of your system. This requires us to constantly be predicting and then scaling to handle that load. So how are we able to do this? What we're actually doing is tracking those slots I mentioned earlier. The data inside Redis is broken down into what are called slots, and we're constantly tracking the throughput, the usage, and network bytes in and out of every single slot inside the cluster. When we detect one that's becoming especially hot, we'll basically try to isolate it by moving slots that are colder than it off that cache node.

It's a little bit counterintuitive. We're not trying to move the hottest slots; we're trying to move other hot slots away from the hottest ones. The thing I really can't emphasize enough is that you used to have to plan when I used to run a lot of cache clusters. At the time the scale was, and today it's cute. This was a lot of work that we had to do, and it felt like every time we talked to other large ElastiCache customers, they were doing the exact same thing. Eventually it was like, this really seems like the sort of thing we would like the provider to do for us. I happen to remember this because it isn't like this anymore in most cases.

It is great. So how are we able to solve this? We basically built something which we call Cluster Slot Statistics. Now I want to highlight this PR specifically because of the date. ElastiCache launched Serverless in re:Invent of 2023. We opened this PR a full year in advance. Amazon ElastiCache is at the end of the day still a service where we have to make money and build features for our customers, but we knew that in open source we had to have a long lead time to build this functionality. We didn't want to build it first and then dump the code. We wanted to go and work with the community to build something that we all wanted together.

This is not a change that would solely benefit AWS. This makes a lot of sense for an awful lot of customers, large and small. We actually had great engagement from GCP at the time who was also looking at something similar, and as you can see there are 111 comments on this PR. There was a very exciting conversation about it, and we were very optimistic it would be merged. I put a lot of stake in this. I spent a lot of time. You can see I edited this because I wanted to make sure that it was as effectively communicating what we desired to the community as possible.

But we didn't get this PR merged, and the answer was a little bit disappointing. Several months later, about three months later, four months later, a long time later, we found out that Redis Limited just simply didn't want to accept the patch and didn't give us any amount of commitment to when it could possibly be merged. Now a lesser, more cynical person than I might suggest that this was because they were trying to protect a specific roadmap for a specific business model, but I'm sure that they had other better reasons for not implementing a change that would serve customers and other providers very well. I just don't know what that other rationale might be. Maybe you have ideas. Choose your own adventure, really.

The Redis License Change Crisis and the Birth of Valkey

So at the end of the day, we ended up keeping this commit internally. We still had to deliver Serverless and we ended up building it for Serverless. But what happened next wasn't really what we were expecting. So, Madelyn just showed you that AWS built something interesting that had a great opportunity to improve a lot of observability stuff, a lot of performance improvement, a lot of cost efficiency, and Redis for some reason kept this internal. That's not how AWS or any large provider wants to work with open source.

So what happened back then? It turns out that there was trouble brewing on the horizon. As an aside, I have absolutely no idea why Redis's branding now looks like a 1950s diner, but okay, I can hang. Before March of 2024, Redis was BSD licensed, which is truly permissive open source. You can more or less do whatever you want with it. You could use it commercially. You could run managed services out of it, whatever your little heart desired.

In March of 2024, Redis Limited changed to SSPL and RSAL v2, which sounds like a cat falling on a keyboard, but what that means is that these are restrictive licenses that say if you run this as a managed service, you have to open source your entire infrastructure, by which they presumably mean through a dual license approach or cut us a giant check. The Open Source Initiative, OSI, does not recognize the SSPL as an open source license. So this was a closing of the open source branch of Redis, and this pleased basically no one.

I can tell you from personal lived experience that it is extraordinarily hard to shake AWS down. There are better uses of everyone's time, so all it really achieved was upsetting a tremendous number of people, and it lets me make some cynical but very true observations like there's no rug on the stage because Redis kept trying to pull it away during the rehearsals. The problem with this rug pull is that they failed at it, which is somehow even worse, because if you're going to stab your community in the back, at least do it right.

But that's not what happened because what happened was within weeks, the community forked Redis version 7.2.4, which was the last BSD licensed version. This fork was called Valkey under the BSD license, and this was not some sketchy random couple of salty people decided to do it. The original Redis maintainers joined the fork and others backed it. Major companies like AWS obviously, but also Google, Oracle, and a whole bunch more.

The Linux Foundation provided agnostic governance to make sure that this wouldn't happen again, and it stayed BSD licensed. The rug isn't going anywhere. Then in May of 2025, Redis backtracked, or tried to. They switched to the AGPL, which is still restrictive, but it is an open source license. It's too little, too late. The community had already moved on. Valkey had momentum. It had the governance, and it was already innovating faster than Redis was. You can't unring some bells. Some things cannot be easily undone.

Discovering What's Under the Hood: ElastiCache Runs Open Source Redis

Now, I have to admit something based upon what Madelyn mentioned at the start of this talk. I naively thought, until we started preparing for this, why would AWS give a toss about actual Redis licensing? Obviously they're not just taking the open source version of Redis, slapping it on some instances, calling it good. That's not how you or I would build a managed service. That's what we would do. They obviously have magic stuff instead, because running just the open source version, that would be insane. So she had her simple diagram of how Redis works under the hood, and this is the version that I had put together.

In my mind, it was obvious that things like ElastiCache were just API compatible with Redis and had wrappers around concepts of very special things under the hood that never had any real things in common. Like, ElastiCache implements the Redis protocol, but it works the same way. It's all its own code, totally different implementation. I now know that this was wrong, because that's not what they did. They wound up building something. They built actual open source Redis with minimal changes, about 100 lines of code. ElastiCache literally runs the Redis binary, the open source code that was made available. That is what managed service built on open source actually means.

They run the open source engine and manage the infrastructure around it, and that's why they give a toss. So when Redis changed their license, this wasn't a matter of, oh, AWS needs to make sure that their API stays compatible. This was AWS legally cannot use this code anymore under the new license terms. That's a very different problem. AWS had to act. So what did they do, Madelyn? They let me go and fork Valkey. Yay, and there was much rejoicing, at least in these corners here. You can tell that it's my slide again because it's not AI generated. Yes, it's not insane either. Yes, and this is more normal. Exactly.

So Corey gave a pretty high level understanding of Valkey, and I want to talk a little bit more about it. And I want to make it clear that this is not an AWS fork of Redis. This is not what happened with OpenSearch, where we took Elasticsearch, we took the last version that was open source, and we made open source first the distro and then the project itself. Valkey was created from day one inside the Linux Foundation, and it is not controlled in any way by AWS. There are six members of what we call the technical steering committee, including me and Zhao, the other maintainer from Redis that came with me, along with four other very active contributors to the Redis open source project, one from Ericsson, one from Huawei, one from Tencent, and one from GCP.

Yeah, and I want to just call this out because I think this is not necessarily well understood among some of the younger set, if I'm being honest, or folks that are new to the industry. Forking something on GitHub is part of a workflow where you're going to make something better, or in my case significantly worse, is not the same thing as a full-on fork of a project where you start migrating over governance and attention and it becomes the new authoritative source. I mean, when I click the fork button, no one really cares except when it's, you know, if I'm doing it to Linux or something, somewhere Linus Torvalds wakes up in a cold blind panic sweat like something terrible has just happened. What is it? Yeah, I make everything worse. It's like the reverse scatological Midas touch.

But yeah, normally when you click the fork button, it is meaningless to the parent project unless you take the entire community, effort, and attention with it. And that's what we tried very hard to do. We got over 50 organizations to come with us to help build the Valkey project. We talked about some of the core maintainers, but we also had support from companies like Percona, from Snap, from Verizon, from Oracle, all big, well-known companies that are interested in keeping the project open and under an open governance. Because Redis did more than just change the license. They kind of changed the community, and we weren't happy with that.

So since we actually created the fork, we have 150 unique code contributors. We have thousands of commits, and we're doing quite well. We have tens of millions of container pulls, which you know isn't the billions that Redis Open Source has.

But we're really happy that for those who care about the community, we're able to keep building for them. And of course, Amazon ElastiCache came along for the ride. Amazon ElastiCache for Valkey is still built on top of open source Valkey. You have to keep innovating. You can't build a database or a caching service and then get it to a stable point but then just never touch it again, unless you're SimpleDB. That's a service I've not heard of in a while.

Valkey Innovation: Multi-Threading, Memory Efficiency, and Vector Search Collaboration

So I want to talk a little bit about some of the innovation we built, as well as other major innovations built in the community and why they were built, and how they're improving the community and the ecosystem as a whole. Now I want to be very clear. You have no idea in the audience, absolutely no idea what it takes to pin AWS down on hard numbers in a public forum like this. If the slide says it and I'm not already on the stage bleeding out, it's true. You can take these numbers to the bank. They are real. That's why they almost never do numbers, because lies, damned lies, and vendor benchmarks. These are real. And the numbers I'm about to talk about, we have two full blog posts on the Valkey website talking about how we measure this specifically and how you can reproduce it yourself. So if you have any doubts about what I'm saying, happy to verify them.

Earlier on, about 2018, Amazon ElastiCache started working on horizontal and vertical throughput improvements. I mentioned before that Redis open source was primarily single-threaded. It did have this concept of IO threads which helped scale a little bit, but we had customers who were suddenly running into issues where they were having huge spikes in workload and they needed to handle that on a very specific key. Caching often is very lumpy. It's very uneven, and you might have a hot key that needs to handle hundreds, if not millions, of requests per second. So we wanted to build something to help with that, and we actually tried to open source this a couple of times, and the conversations kept stalling out because it was very difficult to build consensus.

But once Valkey came around, we actually took another stab with a new technical steering committee, and everyone was actually very excited about it. And I had the same feeling I did with TLS, which is that everyone wanted this to happen and we were willing to figure out how to make it happen. The way we implement it is basically a single coordinator thread in the main thread, and we're able to distribute work to other IO threads, which is a little different than some of the other designs at the time. If you're interested in that, I did another talk at this re:Invent going into detail about it.

But the main thing that we observed for it is it really helped our serverless offering. So ElastiCache built this thing. We needed it to help scale more quickly in serverless. So on Redis 7.2, it took about 70 minutes to get from zero requests per second to 5 million requests per second, the capacity you need 70 minutes after you need it. But it was scaling exponentially the whole time. But with Valkey 8 and this new multi-threaded IO I was talking about, we were able to get from zero to 5 million in about 15 minutes. And the actual test I ran, I got there in 12 minutes. So these numbers are based off a test that I ran, and I can show the benchmarks if anyone's interested.

So this is how we want open source to work. We have a need from our customers to scale quickly on specific shards, and we don't want to have to do all the work to maintain it just in our private fork. We want to work with the community so that everyone gets the benefit of it. But I'm talking about something that eight of us built. What else have other people built? Ericsson recently contributed basically a new memory-efficient hash table. This is something I've been harping on for years, which is that Redis had a lot of overhead in how it stored individual key-value pairs. In the worst case, there was about 100 bytes of overhead to store a simple string like "food" and "bar" inside the Redis open source engine.

And many people use caching for very dynamic data, usually driven by databases. And so a lot of that data is pretty small. And so if you have 100 bytes of overhead, you're paying a lot more in terms of DRAM cost to store all that. I also want to point out that this was one of the things that convinced me to be less cynical than I normally am about this stuff. If it had just been an AWS project or AWS and some friends, then I would have been a lot more skeptical about a lot of the drama around the license relicense and the fork. But Ericsson is many things, but as best I'm aware, they're not a cloud service provider in the same way. They're not selling a hosted Redis option. So when they start contributing to things like this, that's how I knew that it was real. It started to be, and I looked at who else is doing stuff like this, and it's too many logos to get clearance to put on a slide.

There's a lot of very big companies who care about this, and Ericsson is one of them. They have an engineer whose name is Victor, and his whole job was to make sure that Valkey was running efficiently and effectively inside some of the products produced by Ericsson. Most of this job means I'm just not allowed to touch it, and that's half the battle. Exactly, yeah.

As I said, within AWS, this is also a big benefit for us. This is a graph that we got from one of our large meal delivery services that runs on top of ElastiCache, and they were able to see a 41% total memory savings when they moved from Valkey 7.2 to Valkey 8.1. Now I want to point out when you say that there was a 41% savings, the math on this is confusing because you have to multiply them rather than add them, and certain very pedantic people at AWS during the rehearsal wound up in a polite corporate version of a screaming match over the math here. If you want to question the math, please feel free to go ahead and clear three hours from your schedule for round two. I don't want to be within 50 miles of that a second time.

There's the math, by the way. I don't think it's that hard. People seem to be overblowing this. Honestly, it was like watching two academics screaming at each other, only politely. It's like, don't you both have tenure? Can you please go do something else? It was great, it was great, fantastic.

So I want to talk about a third story, and in my opinion, it's my favorite story. Amazon also has another Redis open source compatible service called MemoryDB, another service we could probably do a better job marketing. Inside that service, we launched a feature called Vector Similarity Search. It was built as basically an extension to Redis open source called a module, which provides a vector similarity search used in stuff like semantic search. It's very trendy with agentic stuff, but I'll let him talk more about that if he wants to.

Yes, this game also was pressed heavily by Google Cloud. Now many folks, including a lot of those who currently work at Google, have forgotten what their company does. But back in the mists of antiquity before many of you were born, Google was a search company, and they were really good at it. So this is sort of like getting back to their roots when they start pushing for this sort of thing. It's like, oh right, I remember those days. It feels almost like the AltaVista era.

So Google Cloud also had an implementation, and as I mentioned before, they were one of the TSC members. Google Cloud also had an implementation of vector similarity search, and their implementation was better than the one we had at MemoryDB. It had horizontal scalability. The one we had at MemoryDB only allowed a single shard, and as we talked about, people wanted terabytes of vector similarity data, not just 100 gigabytes. When we went and open sourced both of them and we looked into which is the better one, we decided on Google's.

So in the open source community we all rallied around the better implementation, which is how we want to do it. It's a little bit hard for us at AWS to accept that we didn't build the best project, but very recently inside ElastiCache when we decided to launch vector similarity search for our caching-based version of Valkey, we chose to use the Google implementation, not the one built on top of MemoryDB. There's a quote here we say it's the fastest as of November 17th, which is AWS speak for maybe things have changed. Yeah, one guess whose $2.5 trillion dollar company required that specific date on this slide if we wanted to put it up in front of you all. Go ahead and guess. It's ours, but it's still very exciting because vector similarity search for in-memory databases is still a very important workload for applications that need very high recall rates, and recall rate is basically the percentage of correct results you get from doing approximate searches from vector similarity search.

The Business Case: Price Cuts, Redis Pulling from Valkey, and the Open Source Victory

So don't worry. Despite how let's talk about business value sounds as an intro, this is not a lead in to an AI sales pitch. Madelyn just showed you all of the technical wins: faster performance, better memory efficiency, improved reliability. These are all real and they are all measurable. But let me translate that into what matters to the business, which comes down to the price that you pay, because ElastiCache Valkey is less money for the exact same ElastiCache Redis equivalent and can be dropped in interchangeably with one another. I built a thing we'll get to in a minute, and I did that repeatedly, and all that changed was the endpoint.

These numbers aren't just performance benchmarks that we're talking about here. These are actual AWS price cuts. Serverless is 33% less expensive than its exact Redis equivalent. Same service, different engine, one third cheaper.

Node-based deployments tell the same story, but the number is 20% because it turns out there's still overhead to running server equivalents, which is something the ancients used to do as well. You're getting the exact same presentation of the exact same service function. One of them just effectively costs far less.

And here's where it gets delicious, if you'll indulge me in a little bit of trollish schadenfreude here. Remember that performance improvement that Madelyn showed earlier that AWS kept internal during the whole license drama and the rest? After Valkey forked and released it with that change in it, Redis pulled it in. You can look this up. It is PR 14039 on the official Redis project. The thing that they tried to take away, they are now taking it back from the fork.

Then there's multi-threading. Redis implemented their own version of multi-threading, and it was slower than Valkey's, so they had to go read Valkey's code and apply those learnings, exactly the same logic flow to make theirs as good. It took them two PRs to get it right. First was 13556, followed by 14017. And my personal favorite is that Redis announced in their Redis 8 launch blog post, we made replication use 35% less memory, which sounds great. Also, it's great because it's Valkey's code PR 13732 and it was pulled in almost verbatim. I'm not trying to be obnoxious here, but there are dozens of features or so where Redis is pulling from Valkey. The thing that they tried to lock down became their upstream. The community that they tried to monetize is now their innovation source.

Now I want to be very, very clear here on something. They are absolutely allowed to do this because this is open source. It's BSD licensed. Anyone can use it. I would argue they are not doing anything legally or even morally wrong by taking these things in, but the irony is spectacular because you tried to take control of a project by changing the license. Instead you lost control entirely and now you're downstream of the fork that left you behind. This is the business case for open source in its purest form.

When you have a truly open community that is driving innovation, everyone around you and everyone involved benefits. Madelyn's team built the memory improvement and contributed it back. Other companies joined. Features started flowing with a lot more rapidity. A lot more folks started working on this. Innovation accelerated, and I mean real innovation, not the AWS version they talk about in keynotes sometimes that's sort of shorthand for we finally released a feature you've been asking for for only six short years. Even Redis themselves benefits from this, as they should. That's what open source is about.

The rug has remained firmly bolted to the floor because the license is open and will remain open so no one can pull it. And your AWS bill as a result of this through no action other than switching over to the Valkey side goes down. That's what happens when the community owns code instead of a company trying to shake down cloud providers. Now I have built something on top of Valkey that will make AWS regret more than they already do, giving me a microphone and access to basically everything. So go ahead and pull out your cell phone.

Phones if you're interested and go and visit shitposting.ai because of course you can't have a talk without AI in it. You can do it now and follow along. You've all been to these conferences and you've all had to write those trip reports to your boss justifying why you just spent a week that felt like three years in Las Vegas. Now I have automated it for you. What I have built here is running on top of a whole bunch of different things, including the Strands SDK. It writes it in my voice. It also gathers all the news that has come out from AWS. There's something like 130 some odd announcements as of the time that I stepped on stage. There's probably more right now, and you can sort them by category, by recency, by stripping out the bullshit ones, which drops it down to something a lot more manageable, and then you can say at the bottom, you give it additional context of what you care about here and it will go ahead and write a thing for you.

It'll take about a minute or so to wind up spitting out when all is said and done because you know computers are a little slow, but this is the sort of thing that you can build, and I have been running it all week and pouring announcements into this thing and last night when I had to finalize the slide deck because you know why do things in advance, the bill for the month on this on the Valkey side was 54 cents. Now yeah, with Redis, it would have been a princely 80 some odd cents, but the point is not the actual cost savings here. There's a performance story. It is less money and switching back and forth again just required changing an endpoint. And of course, amusing.

So yeah, the site is shitposting.ai. Go use it, generate your trip reports, and remember that open source maintainers make everybody's lives better, including you going on vacation to places like this on your company's dime, because the best kind of money is someone else's. Thank you for listening to me. I am Corey Quinn. You are Madelyn Olson, and please remember to give us five stars in the app if you liked this presentation. If you didn't, please give us five stars and leave an insulting comment that tells us exactly what my problem is. We have ideas, we want to hear yours. Thank you. And if you do have anything, we'll be out there to take questions. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community