Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Maximizing EC2 Performance: A Hands-on Guide to Instance Optimization (CMP333)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Maximizing EC2 Performance: A Hands-on Guide to Instance Optimization (CMP333)

In this video, Toby Buckley and Jeff Blake demonstrate maximizing EC2 performance using APerf, a system-wide performance analysis tool. They present two practical demos: a Groovy web application achieving 3X performance improvement through JVM optimizations (disabling tiered compilation, adjusting code cache size, enabling transparent huge pages) and instance upgrades from m7g to m8g, and a MongoDB deployment increasing throughput from 4,000 to 12,000 requests per second by switching from EBS to local NVMe storage on m7gd instances. The session emphasizes breadth-first performance engineering—examining system-wide metrics like CPU IPC, front-end stalls, branch mispredicts, and I/O wait before deep-diving into code optimization—demonstrating that significant performance gains often come from infrastructure choices rather than algorithm changes.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Welcome to the Code Talk: Setting the Stage at Mandalay Bay

Thank you for making the trip all the way out to Mandalay Bay. I know this is one of the harder venues to get people into, so I appreciate you making the trip. Today we're going to talk about what's on the screen here: maximizing EC2 performance. This is a Code Talk, and this is a little bit of a different format than what I'm used to. I've only given one other Code Talk, and I was at a Toronto summit on ground level, so I'm not used to being up here on the stage.

I was hoping to make this a little more dynamic, but given that we're up here and not able to mingle, maybe it's going to be a little bit weird. I have stickers though. If you were answering questions or we were getting some engagement, I was going to give out stickers. I can't really hold them hostage, so if you do want a sticker, please come talk to me afterwards and I'll be happy to give you one.

My name's Toby Buckley. I am a Sr Specialist Solutions Architect on the EC2 team. I focus on EC2 performance and helping customers get the most out of EC2. I'm joined by Jeff Blake. Jeff is a Principal Engineer for Annapurna Labs at AWS where he works on optimizing performance for Graviton instances all the way from user software, which we'll talk about today, all the way down into the hardware, which we'll also touch on. Jeff's a super smart guy, so I think we should have an interesting, jam-packed session for you today.

This is what we're going to talk about. We're really just talking about performance engineering from a high-level perspective. I know we have one HFT person in the audience, so we're not necessarily going to get into the high-frequency trading low-level performance stuff. This is more high-level stuff, although some of the tools we have may bring to bear some good signals for you. APerf is that tool. We'll introduce APerf if you're not familiar with it. It's a great tool for understanding the performance of your system.

We've got a couple of examples. We have a Groovy demo. Any Groovy users or Java Groovy users out here? I'm not a Groovy expert either, but I put this demo together to try to be not so contrived that it seems silly for those in the audience, but one that mimics a more natural, organic growth of an application at a company. It starts off pretty innocent and then it starts getting big and then it gets slow and now you don't understand why. We also have one on MongoDB and how you can get more performance out of MongoDB and how the tools can surface the signals so that you can understand why. Then, like any good talk, we're going to send you off with a call to action.

I'm going to turn it over to Jeff. He's going to run some of the slides and I'm going to do the coding. As I said before, we don't have a whole lot of code for you to follow through with, so none of this stuff is in a repo necessarily, but I'd be happy to connect with you offline. We could maybe offer up whatever we can and potentially get some of this in a repo if that's of interest to people. Hit me up afterwards, outside or after the talk, and we'll talk about what logistics look like moving forward.

Performance Engineering Fundamentals: Going Wide Before Going Deep

So we're going to introduce performance engineering for those that may not do this day in and day out like myself and some of my team. Show of hands, who does performance engineering as their primary role? One person. So you might know some of this already, but for the rest of you, we're going to give a quick primer on performance engineering. When we go down to the code and try to figure out what we're going to try to take advantage of to make things go faster, you'll have a little bit of a base to rely on.

Performance engineering very simply is just finding opportunity in your system, whether that's for efficiency gains, price efficiency, or performance efficiency, which I like the most, but price performance engineering is also something we want to find opportunities for. While that sounds simple, it has some big challenges. One of them is abstractions that you rely on to build your software. Everyone uses things like sockets and web frameworks to build their software so they just concentrate on their business logic. When looking for opportunities for performance, those things can leak. You have to start looking underneath your abstractions that you rely on to build your logic, but now they're not performing the way you want. You have to go and look underneath the covers.

Case in point, we were helping a customer optimize some code that they were trying to get performance on Graviton. I'm a Graviton engineer, so I'm going to talk a lot about Graviton. They had this abstraction they built their software on, and we had to actually go into the abstraction they had no knowledge of and actually show them that their optimization wasn't in their code. It was in this abstraction that we had to go and do some away team programming to make faster. Again, abstractions leak. Everything has a cost. Now you have to start thinking about what you've built.

What you're building on, and what those costs are to try to understand how you can find more opportunities. You don't just have a VM that's all by itself; it may have different storage and different networking characteristics. You have to know what those costs are, not just in price, but also what performance it can get or what performance you're not getting from that.

Another part with performance engineering is that bottlenecks may hide others. You can have a bottleneck that you're trying to remove to get your performance up, and then when you finally fix it, you find you've uncovered a worse problem that was just hiding behind it. We had something like this where we fixed the performance of a network application and we uncovered a synchronization problem that actually made the performance worse by 15x. We had to dig deeper into the abstractions we were relying on to find that second bottleneck before we actually got to the performance we expected.

These are things that can finally lead to what I call a search space explosion. There are lots of things to think about. It's a very rich environment to go and learn things, be curious about what it is you're building, and what you can do to make things faster. You start understanding the different layers and the different costs of everything you're working with.

But let's talk about how we're actually going to do performance engineering. Ideally, the model is you define what you're going to measure, you measure it, we understand it, we tune it, we get some performance, we get some efficiency, and we go around this loop a few times and finally come out the other side and say we're done. Performance engineering, for those who are wondering, is never done. It's just something you stop. You say my return on investment is good enough, I'm going to stop. I'm not going to look for that extra 0.5% or extra 0.10%. It's not worth it.

But the reality is going to be a little different. With performance engineering, there are lots of tools out there. These are just a handful. You can use things like strace, iostat, eBPF, and asprof. These are all tools that can help you go very deep into parts of your program. This becomes a problem because now you're taking meandering paths. It's no longer this nice concise loop or straight line path. You may have to look really deep, take a couple of false turns, and take a couple of dead ends before you find your tuning opportunities and then come back to remeasure what you've done and see if it's helped.

This turns out to have been somewhat of a problem even on my team where we try to use things like intuition to know what tools to use. We said let's take a step back and figure out what we want to share with you from AWS. We said let's take a step back because we're doing these performance engineering problems and we're not getting as far and as fast as we want. We found that our intuition is actually something that gets in our way. Intuition can be misleading. Doing these depth-first searches can be very inefficient, especially if you pick the wrong path first before you come back out and say that didn't do anything, let's go somewhere else.

It's simply this: let's try going wide before we go deep. That's just what it sounds like—breadth-first search. Prioritize your opportunities by looking at the full system first. When I say full system, I mean things that you may not even think about when it first comes to finding performance. Some people will say it's my code, I should know what to go fix in my code, but maybe you need to look at something more system-wide that's not even related to your code. That's where we found that sometimes big gains hide in plain sight.

We had another customer that said we need to optimize the compression library. It's absolutely the compression library. That's where our intuition was telling us to go, and we took a step back and said let's look at the whole system first. It turned out compression wasn't even the bottleneck. It was just that the machine was running out of memory and then swapping to disk. We fixed that opportunity and then the performance lifted up by 50% and we were done.

Introducing APerf: A Wide and Deep Performance Analysis Tool

Putting that into graph form, we say let's define our measurement, look at all of our signals, get the system-wide understanding, get our tuning candidates, and then we finally go deep after we pick the ones that seem the most profitable to go look at. It's all well and good to talk about it, but we decided, or we should say we want to share with you a tool that we've developed that helps take away the need to remember 15 different tools to go use to get my wide view before I go deep. We started developing this tool, and now it's on GitHub and version 1.0 as of last week called APerf, and it's a wide and deep focus tool. It's not meant to go deep in just code performance tuning. It's meant to look at everything very, very wide before we go deep. Toby can share a couple of anecdotes as to how he's used it in his role.

As a Senior Specialist Solutions Architect, I don't know if I'm still on. As I say, oftentimes customers call us. We're in an account and we have some problem we have to deal with. The beautiful thing about APerf is

how many people have heard of APerf before today? No, okay, one. Okay, the performance engineer has. Great, so that's good. But the beautiful thing about it is that it lets us solve problems that we can't tackle alone. If I have a problem that I need to bring to Jeff and his team, we need some kind of common set of language and common data to do that. APerf is that tool. It lets us ask the customer, if they're having a particular problem, to go deploy it, record some samples from a run, let us know what's happening, and send it to us. It packages up a nice tar ball with all the reports in it, and we can then visualize it. We can send it downstream if we're stumped. So it's a nice thing for making a shared reality for all parties.

Before we go any further, I want to say it's not the only tool. It's a tool for your toolbox. It's a good, wide and deep tool. As I said, there are plenty of other tools out there that go much deeper. I could talk forever about some of those deeper tools if you catch me after. But APerf, as I said, we want to measure hundreds of system-wide statistics. These range from very high level metrics like CPU utilization and memory utilization all the way down to the deep, low-level metrics that tell you how the CPU itself is performing. All of these signals can be taken together to try and find those opportunities.

It's meant to be simple to use. It's a self-contained binary. You put it on the system under test. It's a point tool. You don't have to onboard a huge service or infrastructure to get it going. You can use a test box, put it on there, take a recording, pull the recording off, and then it generates a static set of web pages which will show, as we get into the code portion of this talk, the results. It's very low overhead. We've measured it as less than five percent of one CPU when everything is turned on, which we find to be an acceptable trade-off to get the hundreds of statistics we want to measure.

Regarding deployment, you can just SCP it onto your VM or onto your bare metal lab machine. It also has the ability to be packaged up in a container and put into something like a Kubernetes pod to measure a Kubernetes node in a privileged container, which we know is a fairly common use case for deploying web services. That's a new capability we've been working on in the last month or so, and we'll show that here today as well. Okay, I'm going to hand it back to Toby. I need to get on my laptop to set things up for the demos, and he'll explain what we're going to talk about first.

The Groovy Demo Setup: Aspect-Oriented Programming and Service Level Objectives

Cool, thank you. So as we mentioned, Groovy is going to be the first demo, not because we love Groovy or anything else, but because it has some of the stuff we're talking about. Anybody heard of aspect-oriented programming? Okay, AOP. It's been around forever. It's basically the idea that you have some cross-cutting concern you want to apply to a bunch of different methods. Maybe we want to do logging or whatever the case may be. You could define that as an aspect and apply that aspect in multiple places. It's beautiful for maintainability, testability, and readability. But not without cost. Just like everything in engineering, there is always a trade-off. So really, one of the goals of this talk is to help you understand what those costs are so you know where to make those trade-offs.

The example is not terribly contrived. It's fairly full-featured, we think. Here's the setup and topology. It's a simple cluster. We have three node groups: one dedicated to our load generator, and we're using wrk2, which is a nice load generator that gives you nice tail latency numbers and tells you requests per second and all that stuff. An m7g is used with a few different flavors: one for an unoptimized version, and then another one we'll just restart the pod with some optimizations and see what effects we have. Then we have an m8g, which is also a potential option for you if you're trying to get performance out of a system. You could always make the hardware bigger or faster by potentially throwing more hardware at it. So we walk through all those scenarios and do a price-performance analysis, which we'll give to you verbally, but we have some numbers we can give you at the end of that.

With that, I'll let Jeff switch over, and we'll get to the demo. Okay, so I'm going to be showing everything in my VS Code window. Can everyone read this that are in the back row? Do I need to go a little bigger? Back row, guys, everybody's good? Okay. Let me know if I need to go one option bigger. It's already a little cramped, but I'm just going to show really quick. This is the container we're building our Groovy app into. It's based on Amazon Linux 2023. We put Corretto 21 on there for our Java engine, and then we install our Groovy app by building it with Gradle. Then we transfer it onto a server or a container that's using Tomcat for our web server, and we throw this on here and we use some very basic Catalina options for the JVM.

Specifically, 8 gigabytes of heap and G1 GC, which is very basic stuff. Now we're going to take a look at the Groovy code itself. What we did here is we started building up a very simple web application. As Toby said, it's not completely contrived, but it's meant to show examples rather than be a full-fledged enterprise service. However, we've seen this help with other customers running full-fledged services, so it's a relatively good example.

What you might do in a JVM type language like Groovy is start defining endpoints like the hello endpoint and the process endpoint with some arguments to it. As we go along, we're going to say we want to do some things in our endpoints, but we primarily want to have separation of concerns. We want our business service logic in its own package. We want to call that from our endpoint. Our endpoint just handles HTTP requests coming in and responding to them.

As you build up an application, we can see we have other endpoints like a message endpoint, a setup endpoint for authorizing users, a health check endpoint because we're in Kubernetes, and also we want to be able to pull metrics. Another health check in the Kubernetes cluster wants to pull some metrics to keep an eye on the health of the deployment. As we go, we might say we really want to take a look at all the people accessing our system, so we want to log every client IP to get security auditing going. We want to check that everything is authorized.

The way we're applying these aspects is that aspects are in the aspects directory. All the code for doing authorization and other concerns is in here. We can do that either as annotations, or if we say we want metrics to apply to everything, we can do that with Spring Boot that we built upon with cut point filters. We're really taking and leveraging aspect-oriented programming to have all our concerns separated. Cross-cutting code can be applied to all the endpoints we want, whether it's explicitly with annotations or using the cut point filters to apply it more broadly. This builds up a nice web application with very clean code.

The thing that we will start noticing is that we want to keep the performance at or above a breaking SLA, or more specifically, a breaking SLO, which stands for service level objective. Everybody familiar with SLOs and service level objectives? This is something I hear customers say they don't really do a good job of conveying, or it's just not out in the wild for consumption. Think of an SLO as what you need for your business to operate. It's kind of like an SLA, but it's more about the business perspective versus the end user perspective. What do you want to run from a business perspective? How quickly do you want to run it? Maybe a P99 of 100 milliseconds with a throughput of 5,000 requests per second or something like that. You pick that, and then that's what really sets the anchor for your performance when you're making changes. You know if you're going up or down from that SLO or staying within it. If you're outside your SLO, you're breaching your own contract.

That's really what underpins a lot of the work that we're doing here. The SLO should ideally be something you set at business time, not necessarily a comparison to something you've done before. In this case, our SLO is we want to stay under a P99 of 100 milliseconds. We found the breaking latency. If we went past 4,000 requests per second, which is what our workload generator was giving us, we got to about 50 milliseconds. But then if we went any further, we went well past 100 milliseconds past 4,000 RPS.

You've probably seen that curve before, right? The braking latency curve. It's going up. I'm at 2,000, I'm at 3,000, I'm at 4,000, I'm at 5,000. It starts to plateau and then diminishes. That knee of the curve is the braking latency. We're staying right at that latency. So we're going to say for the sake of this code talk that this isn't fast enough. We want to go faster. We want to get above 5,000, maybe up to 10,000, more than double the performance. How will we go about doing that?

The first thing you might say is we can maybe optimize the code, but if you look through this example for a code talk, there's not a whole lot here. There's not a lot of logic. The metrics aspect is inserting something into a concurrent hash map. Our business logic isn't terribly complex right now, but we're still not able to push all that much throughput for 100 milliseconds at P99. So we've already gathered an APerf report that we'll go into next. As we're looking at that APerf report, I'm going to kick off an optimized run. We'll come back to it in about three or four minutes.

We're going to patch our pod and add a bunch of different Java optimizations. We'll discuss these optimizations in turn, explaining why we're doing them and why we would use signals from APerf to guide our decisions. We're doing a lot of this through a script to make our lives easier. If anybody's interested in what that script is doing, please come up afterwards and we're happy to show you. It basically deploys the pod, starts the run, records and stops, and makes sure that everything has good timing.

Reading the APerf Report: CPU Utilization, Flame Graphs, and Performance Signals

We've already gathered the APerf report for the base program. This is on an m7g.xlarge with four CPUs. When you open an APerf report from the index HTML, it displays this type of website and web page. Before we continue, I apologize for the interruption. How many of you have heard of or are using Graviton? That's a show of hands for who's heard of it and knows what it is. Probably about sixty to seventy percent of the room. How many are using it? Great. For those who aren't, I'd like to talk to you afterwards and find out why.

Everything we discuss here will mostly apply to x86-based instances as well. Many of the things we talk about for Graviton aren't unique to Graviton, but we're using Graviton because I know a lot about how to make it run faster, so it seemed like a logical choice. APerf, as you can see here, has a homepage that tells you some very basic statistics about the recording you did. Things like checking that your AMI ID is what you thought it was, checking your instance type is what you thought it was, and checking the kernel version is at the version you expected to measure. These seem very simple, but I've personally run into cases where people told us they ran the comparison on a twenty-four x large system and compared it against an eight x large, and the eight x large was slower. Well, that seems expected, but these types of things pop up, and sometimes it's good to have that right in your face as the first thing.

Along the left-hand side, we have a bunch of different statistics. As I said, we collect hundreds of different statistics and we've grouped them into logical units like CPU utilization. We can click on CPU utilization and get aggregate CPU utilization with a time series along the x-axis and utilization along the y-axis. We can see we're pushing around sixty to seventy-five percent CPU utilization for this application. We're really riding along the top edge of how much we can get out of this pod because we're told that you only get to use three CPUs and it's a four CPU node. Nothing here looks out of the ordinary, and this is where the wide part comes in.

We want to start looking at different things. Maybe we check our memory utilization to see if it's going up and down or doing something odd. We see that we're staying pretty constant and our heap is doing what we expect. There's a lot of other things in here. I encourage everyone to go out and take it for a quick spin and see if it's measuring things that you might not already be measuring or thinking about. We get things like more detailed virtual memory stats, interrupts, disk stats, kernel config, and sysctl config where you can go and fine-tune things like the TCP networking stack or your scheduler if it's not behaving like you expect.

We also get code profiles like flame graphs, standard system flame graphs from perf. We're running a Java application, so it doesn't look like much here, but we also have the ability to interface with async profiler. We're able to do CPU utilization profiling. Have folks seen flame graphs before? For those who don't know, a flame graph really shows you a population of function calls along the x-axis and stacked depth along the y-axis. If you see a wider bar, that's not necessarily how much time was spent in that function. It's how many times that function was called. So it's a pure alphabetic numerical representation. The population of function calls determines width, along with stack depth.

It gives you a proportion of how much time each function is run. It's not necessarily telling you the order. The order is only applicable when you're going up and down the stack depth. If you're going from left to right, there's no information there. You only have to look at the widths of the bars. Async profiler is used here to gather these plots. There's a little legend here for those who are curious. Bright green is JIT compiled, which means the Java JIT has actually compiled it into native code. Light green is inlined. Dark blue is things that you're matching against.

The first thing we want to notice here is that in this flame graph, if I want to optimize my AOP code and make the code faster, this stack trace has lots of functions that I personally don't ever remember writing. I don't remember writing an internal do filter or invoke invoke exact underscore MT at 20. These are all of the things that implement the aspects in the object-oriented programming in this JVM language that we wrote this in. It goes on for almost seemingly forever—it's over 100 stack frames deep. If I tried to look for code with this magnifying glass for ALP heavy code, I found some, but we have to go and we find a handful of things that are things that we wrote. So there's really not much we can do to optimize this code if we just want to go in and hack on our code. We probably need to go look at another signal, another opportunity.

Understanding CPU Microarchitecture: Front-End and Back-End Performance Bottlenecks

One of those opportunities might be to look at why the CPU is not performing as fast as we think it should be. Before I go deeper into the PMU view of things, let me ask how many people here are microarchitects that have looked at how a CPU works. One person. Okay, I'm going to talk to you after this. I work at, we used to be on the same team. Oh, Julio, okay, now I recognize the voice. Definitely talk with you. No, it's the lights in my eyes, man. Okay, all right, let's talk about how a CPU works because I'm going to go into quite a few CPU metrics.

For those that don't know, a CPU is literally the simplest loop state machine you've ever heard of. It gets instructions, we do some math, we go back to one. This is the abstraction that we all kind of depend on, and it goes around the loop billions of times a second for any modern CPU. The Graviton CPU is almost 3 gigahertz—3 billion instructions per second. Some of the high-end desktops can get almost 6 billion instructions, 6 billion cycles a second. But it's really just getting instructions and executing them, just doing some math.

To go a little bit further and look at what APerf can tell you, we want to look at a little bit more detail. A modern CPU, because it's trying to get as much performance as possible, has two halves. We'll talk about a front half and a back half. The front half gets instructions from memory and feeds it into a queue that the back half executes. This is almost like a microservice where you have separations of concerns again, where one half is doing some work, pushing it to a queue, and the back half is doing some more.

But the front half can't wait for the back half to tell it if it's going down the right path in your loops or in your conditionals, so it's constantly predicting where to go next from the previous history. If your loop is doing 1000 iterations, the front half will try to predict like I saw this loop previously, I'm going to do 1000 iterations of this loop and feed it to the back end, and hopefully that's correct. The back end will then tell it whether or not those were correct. Every time you're wrong or the front end got the wrong instructions, you basically have to flush the entire thing and start over, and that's a very painful thing to have in your code. You want these two halves to operate at full speed at all times, and then you'll get the maximum performance.

So let's go back to the demo. The very first thing we're going to look at is just what's the throughput of our CPU, and this is the IPC metric. APerf now has some annotations you can click on things to get some help. This is instructions per cycle. Instructions per cycle is just how fast the CPU is processing, and you really want to have something greater than one. If we look at the average here, it's actually under one. So we're spending a lot of time doing not a lot of work. We want to drive this as high as possible actually for most of our code. Modern CPUs can get anywhere upwards of 8 to 12 instructions per cycle through their pipeline if you get everything aligned. It's very hard to do this, but I'm just saying that CPUs have this capability. They're really fast.

So we want to go a little bit further. In APerf opt, it puts our PMU stats in ways that we can understand just like we had in that slide. So we can look at front-end stuff first, and we can see that front end is stalls per 1000 instructions or 1000 cycles. Anything above zero here means that things are stalling and not doing any work, and we're at almost 60 percent of the time, or 600 cycles out of 1000. The front end can't actually feed anything into the back end. So we really should probably look at here first, but since we said let's go wide and deep, let's see if our back end is actually worse. It's not. The back end's actually processing things pretty fast. It's at 2, it's only stalling 200 times out of 1000. So we can put that to the side and say our opportunity is really on the front end. So what are things that we can look at there. Some things are branch misses, which represent how many times we incorrectly predict the CPU's future behavior based on previous patterns.

This is another metric we want to drive as close to zero as possible. A rate of 10 per 1000 instructions is not good because it means the CPU is stalling and flushing things quite often. Similarly, we need to examine whether the instruction memory we're fetching from is being well utilized. There are caches along the CPU microarchitecture before you reach main memory, and you want to place all your instructions in the closest memory possible so they're fast and easy to fetch. In this case, we're not doing a very good job. We're missing out of that cache 60 times out of every 1000 instructions. We can see the same issue when translating virtual addresses to physical addresses, which is also fairly high at 4. All of these statistics should be driven as close to zero as possible.

I want to go back to those JVM options that we've put in and discuss the actual results from that script. The things we did were focused on the front end. To make front-end-bound performance happier, you want to put things close together and squeeze the instructions as close together as possible. These things help you put instructions in closer memories and help the branch predictor track all the history because it is also a cache. If things are spread out too far, the predictor will start missing. You also want to put things in contiguous parts of memory to take pressure off the components that do the translation. These are all options that we actually put into our Java application.

We said we want to turn off tiered compilation, which is a way for the JVM to achieve faster startup times by compiling methods twice. It first compiles in very simple assembly code, and then after it sees the method run a couple of times, it does the full optimization. This actually keeps two copies around, which takes up space and pollutes your caches. However, we can turn this off if we're not worried about startup time. Some people might be concerned about startup, so you may want to keep this on. It's something to experiment with.

The same applies to reserved code cache size and initial code cache size. The default is 256 megabytes, and the JVM in this case is not very smart about where it places methods. It simply finds a gap and puts the methods there. However, if you constrict the space it can use, it will actually start packing methods better for you. Finally, we enable use transparent huge pages, which forces the machine or the JVM to put things in contiguous memory addresses so you need fewer entries to translate between virtual and physical memory. All these options were just command-line options. We didn't have to change any code. We only had to redeploy our pod.

Optimization Results: JVM Tuning, Hardware Upgrades, and Code Refactoring Trade-offs

The final results show that we got almost 20 percent more throughput at 4750 requests per second. Our P99 latency is still under 100 milliseconds. That's actually pretty good and represents a nearly 20 percent return on investment for relatively cheap tunings. We can look at the report to see if any of the things I just talked about making the front end better actually worked.

Just to point out quickly, we're talking a lot about Java and Groovy right now and the tunings you could do in a JVM. There are corollary tunings for basically all languages and all platforms. We do have some of this stuff documented, and we'll touch on that in a second. I don't want to go too far down the Groovy path and lose people, but I want to let you know that APerf is going to be the thing that surfaces those signals. If you do get those signals, the question is how do you fix them. That's what we're really trying to teach you here. We're trying to focus on using the data, all of the data, to guide where your best return on investment and optimization can be.

Here's a comparison report. This is another feature that APerf has that we find very useful. You can put two reports side by side to do a compare and contrast, an A-B comparison. This is super interesting. What happens if we have two different microarchitectures in our environment? If we have, say, an M7G on the left and M6I on the right, the report does the exact same thing. It puts them side by side, and as much as possible, the metrics that we collect will be named the same and be comparable between each other. The PMU events will be named the same and map to the same basic ideas. Cache misses are still cache misses. Instructions per cycle is still instructions per cycle. You can look at them and compare one to one. That's pretty huge if you've ever done any kind of performance work, understanding the different nuances between, say, Intel and AMD.

Now let's jump straight to the PMU events and show some other features here. In the original report, we had a time series visualization. If you don't want to squint too hard, it's difficult to tell whether one line is actually 10% higher than the other. We've now added a summarization on the reports under comparison, and we can see the details more clearly. Instructions per cycle improved by 12% on average, so we delivered on what we said we could do. We actually improved the CPU performance by about 13%, even though the score went up by 17%. Stall front ends actually went down by about 6%. It may not seem significant, but every time you can reduce front end stalls, they're far more expensive than anything on the back end because the front end works pretty much in order. It has to get one set of instructions before it can get the next from memory, but the back end can do things way out of order and in parallel.

If we look at stall back ends, this is the interesting thing you might notice. When you remove one bottleneck, you start pushing on another. If you removed all the stall front ends, you'd actually see this would blow up to be a huge difference in the negative, in the red, because now you've moved the bottleneck from one part of the CPU to another. But as we discussed, this can happen in code as well. Branch mispredicts went down by 20%, which is good. We're doing exactly what we said we would do, and all the things that we thought would happen because we constricted the code cache and reduced the number of times we recompile methods. When we put things in contiguous memory, things get faster. Instructions got packed tighter, so we're missing less in the instruction L1 cache. When we put things in contiguous memory, we got a big jump in the amount of times we miss in the cache that translates from virtual addresses to physical addresses. The core doesn't always have to fault and go look through a page table. It doesn't take page faults all the time. It actually tries to cache those translations, and we can see this went down by 60%, which is a big decrease.

But if 20% is not enough, what can we do then? We can continue down this path of getting the hardware to execute our code faster. We can go to m8g. We've already run this in the background. m8g is getting 7000 requests per second for the same code with the same optimizations, and we're actually running at a little bit better P99 latency. But again, if we push past 7000, it blows up pretty quick, so we're already at our breaking latency here. We got 60 to 70% more performance with 10% higher cost. That's a 60% price-performance benefit from going from m7g to m8g, which is actually a pretty big win. So far we haven't actually touched any code.

If we open that report, I'll go through that really quick. This is comparing back to the same one that we just looked at, the optimized code path against an optimized code path in m8g. If we increase that one, we get 38% more IPC. This is just to jump in here. This is a function of the processor getting better. The processor got more efficient and is able to give us these gains. You'd see that year over year or generation over generation, not only Graviton but basically any processor that everybody's endeavoring to do this kind of stuff. Everyone's trying to get the processor faster so you get those free upgrades. Graviton is no different. Every generation we've been trying to shoot for 20 to 25%. Intel and AMD try to do the same. If we've got plenty of different levers to make things faster, but you can see even in APerf that this is what's happening. The cores are getting faster going from 7th generation to 8th generation, and we're getting a lot of performance for a fairly small bump in price if that's a trade-off you're willing to make.

Back-end stalls actually got higher again because we've moved the bottleneck from front end to back end, and branch misses went down by almost 50%, so everything got better. But so we could say, no, I don't want to move to m8g. What can we do then? Maybe there's some constraint. Perhaps we haven't dropped m8g in the region that you have to run in or something like that. It's a real problem that we've heard customers talk about. So at that point, what's one of the answers? Really, as we said, everything has a cost.

Aspect-oriented programming or leveraging object-oriented programming interfaces, abstract classes, and lots of derived classes all have costs. Many times the cost is hidden in extra code needed to make all of these nice software abstractions and constructs work, and they run extra code. If we're really set on needing to use 7G, then we have to actually change the code. In this case, we went to the extreme version and removed all of the aspects except for the very handful of ones we absolutely need, like using the aspects to turn these into HTTP endpoints. We started inlining a lot of the aspects that we showed earlier. Tracking authorization stats is now inlined, logging is now inlined. We also inline metrics and we also inlined rate limiting. We've taken away a lot of things, and we've made the trade-off that we're going to make the code less maintainable. Maybe this is not a trade-off you want to make, but if we did, the question now becomes whether this did anything. Did we get a return on our effort, or did we spend a lot of effort and potentially incur tech debt just to get a small modest gain?

I hope I don't see a review or survey that says we told you not to use object-oriented programming. That's not the goal. The goal is to show that there is a cost to all this stuff and it's not free. Although in your organization, you may value maintainability and readability more than performance, and that is perfectly fine. That's your decision to make. The point is that you can use all the data to try to come to these trade-offs. You're seeing the CPU going slow. You have flame graphs that are 200, 300, or 400 stack frames deep. That's all telling you that the code might have gotten too complex or the things you're using are adding lots of extra overhead that wasn't immediately obvious just by looking at the code that you wrote.

If we wind things back and take away some of these aspects, we actually get a pretty large healthy return for our application. We're up to 11,000 requests per second and still under our SLO of 100 milliseconds at 28 milliseconds. That's on 7G. So from 7G to 7G, we can increase performance by almost 3X just by rearranging parts of our code. We still have the same functionality, but how it's implemented, exposed, compiled, and then run has changed. We can take a look at the report and these are all scripts so I can open these up fast. We'll go straight to the Java heat maps. We can put these side by side. The thing to notice here is ALP optimizes on the left, so it's the same code with all of our aspects. If we run the clean version on the right, one thing we noticed is the aspects we still have are very, very deep flame graphs with lots and lots of calls that we have to make just to make the code work. If we start backing those out, I don't have to scroll down nearly as much to get to the end of the code.

Really what we did here is instead of redoing concurrent hash map or optimizing anything with assembly, we took away the overheads that we saw from the signals we got from APerf, which was that we're running a lot of code we didn't expect. We asked whether we could take away some of that code, run fewer instructions, and that leads to some pretty sizable performance gains. It's not always about coming up with a clever new algorithm. It may just be things like this that are hiding in plain sight. That's the Groovy demo. We do have the getting started guide for all of these optimizations we talked about.

There is a getting started guide, and it's a Graviton getting started guide. Don't let the name fool you. All the stuff that we suggested in there is applicable across architecture, so it's not just a Graviton thing. It's all for a bunch of different languages too, so it's not just going to be Java or JVM-based languages. You'll see C++ and some other stuff in there too if you're interested. You can run it on Intel, AMD, whatever you want. So next demo.

The MongoDB Demo: Identifying Storage Bottlenecks with APerf

And we're coming up on 15 minutes, so we're going to probably buzz through this fairly quickly. We have MongoDB. The whole goal of this was to show you that APerf is going to surface signals, not just of your own code, but of other things too. It's an ambient collector. It just sits on the box and collects all the stuff and serves it to you and gives you nice visuals. So what if one of those things that we needed to do was MongoDB? We had MongoDB and it was running poorly. Is there any way to, without rewriting MongoDB, figure out why it is running poorly?

So this is our next setup here again. The topology is very simple, same three node groups. The first one is our load generator where we're running YCSB to load this up. Another one is running two different types. There's actually a typo in this. It's an M7G and an M8G. That is not the case. We have an M7G and an M7GD. Does anybody know what the D on the end of the G is? It's attached NVME, so no surprise which one's going to probably run faster here. So we've got EBS backed storage and NVME storage for the GD. The two node groups are M7G extra large and M7GD extra large, both running MongoDB. That little A is indicating that we have an APerf pod running on there alongside it. After we start it up and load it up, I'll let Jeff do his thing and we'll look at that.

So because this is an application that we didn't write, we're not going to show any code of MongoDB. We're just going to say we deployed a pod with this pod spec where we took MongoDB 8 and said we want to run MongoDB 8 on 3 cores and we want to get, say, 6000 requests per second on MongoDB 8. We're going to attach some storage to it and we just attach 128 gigabytes of storage and we use a persistent volume claim on GP2. We think this should be good enough. If we run YCSB against it just to load test and see if we're correct, we come out with a score of only 4000 requests per second. So this is a case where we need to figure out what we can do to make this faster, and we're not going to optimize code. The point of this talk is to show you can optimize by optimizing EC2. How do you do it with all of EC2, all the instance types available?

So again we went and recorded APerf. What will APerf show us? We go and start looking at all the signals first before we just dive in and say maybe I shouldn't use MongoDB. That is an option, but can I make MongoDB go faster because I really want to use MongoDB for some features it has? So we'll open up the EBS report again. Now we should be very familiar with the APerf start page. We made sure we're running an M7G and we don't have to go very far before we see a problem. We see that our CPU utilization total is pegged at 100% or very close to 100%. Looking at the legend, if we look for user and system, it's way down here. It's hardly using any active user code at all. Because we're using EBS, we're actually sitting a lot of the time just waiting for the disk. We're in IO wait most of the time, 65 to 80% of the time. IO wait is just a signal telling you that I have threads ready to run, they could do something if my disk came back fast enough.

So in this case, we're looking at being disk bound. I don't have to change the code. Maybe if I just throw in something with faster storage, we can do better. We have a little thing here where we have to say what are the costs? So this is another thing we're thinking about with performance engineering. We look at it holistically with all our data. What are the costs to access data? This graph is just all the various things you can access on the system that's some form of memory or storage from a CPU register all the way up to S3. Along the Y axis is latency in nanoseconds on a log scale. As we go, we're increasing the amount of time it takes by 10x. It's a log 10 scale. It's not even a linear chart, it's actually more than linear, so it's more than exponential as we get out to disk, where we're taking tens of thousands of nanoseconds, hundreds of thousands. If you're using S3 as your storage, it's tens of millions to hundreds of millions of nanoseconds to just gather some data to use for your compute. In CPU terms, that's an eternity.

If we put this into human time, this chart is just that same graph, but in table form. The CPU times are in actual time, but if we scaled it to you're doing the math, you're the math engine, and you can do a math problem once a second. If we're going out to local EBS or remote EBS, because EBS is its own distributed storage system that's spread across an entire region, if you have to get a piece of data before you can complete your calculation, it could take anywhere from 20 of your days or my days to half a year of sitting there waiting for someone to run and get you the paper and give it to you.

That's where we say, well, maybe we should try something like local SSD, the i3d. It would only take 1 day, which is an order of magnitude faster.

So we'll go back and we'll see if that's actually true, if the data that we collected with APerf is actually going to give us some advantages if we went to say the i3d and we ran this in the background while we've been talking. If we run YCSB with i3d, we see a 3X performance increase. We're up to 12,000 requests per second, and if I opened up the APerf report, you'll see exactly that IO wait goes down to almost zero and our CPU time is now the dominant factor. We're actually computing stuff on our document database instead of just waiting for the disk to get back to us.

Key Takeaways: Finding Performance Opportunities Beyond Code Optimization

And so that ends the code demo. So back to the PowerPoint. The takeaways here are that performance engineering is not just about hacking up great new algorithms and writing in low-level languages like Assembly or C, but it's about finding opportunities wherever they are. It may not even be code; it may just be opportunities of optimizing the EC2 instances and their setup that you use. We did that plenty during this demo where we weren't hacking on code so much as we were trying i3d or m8g instead of m7g.

Those opportunities could be anywhere. They're not necessarily always in the algorithm. It could be how you set up the networking on your devices, the instance families that you choose, the sizes that you choose, cores and memory ratios. That's where we've found a lot of great takeaways to share with you, which is using APerf or other tools to get that whole view before you go and start hacking away to make things faster.

Part of it is understanding what your system wants more of, right? How do you identify that? Once you've identified it, then you can make the right decision, maybe whether that's a new instance, whether that's faster discs, or whatever. We've got 7 minutes left or we can let you go early if you've got questions. We're happy to answer them, but if not, please fill out a survey and tell us you loved it, hated it, or whatever so we can get better next time. I do have stickers.

; This article is entirely auto-generated using Amazon Bedrock.