Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Load balancing evolved: ALB Target Optimizer (NET336)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Load balancing evolved: ALB Target Optimizer (NET336)

In this video, Ashish Kumar, a product manager in AWS Networking, introduces Target Optimizer, a new feature for Application Load Balancer designed for low-concurrency applications like large language models. He explains how traditional load balancing algorithms like round robin and least outstanding requests fall short for compute-intensive applications where each target processes only a few concurrent requests. Target Optimizer solves this by using an agent installed on targets that communicates with load balancer nodes through control channels, signaling when capacity is available. This approach achieves high request success rates, improved target group utilization, and automatic load shedding. The setup involves installing the agent, configuring max concurrent requests, creating an optimized target group with a control port, and adding it to an Application Load Balancer. The feature supports EC2 instances, IPs, Amazon EKS, and Amazon ECS services.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Challenge of Load Balancing Low-Concurrency Applications

Good afternoon, everyone. Welcome to NET 336. My name is Ashish Kumar. I'm a product manager in AWS Networking. Today we'll be talking about a feature in Elastic Load Balancing that we launched about a week ago. This feature is called Target Optimizer.

The agenda for the next 20 minutes is as follows. We'll spend some time discussing the problem, then the solution. We'll introduce Target Optimizer, describe how it works, talk about how you can set it up, and finally conclude with some additional information.

Elastic Load Balancing is AWS's suite of networking load balancing products. It comprises three main products: the Network Load Balancer, which performs layer 4 load balancing; the Application Load Balancer, which performs layer 7 load balancing; and the Gateway Load Balancer, which performs inline traffic inspection and firewalling. Target Optimizer is a capability of the Application Load Balancer.

For those who aren't familiar with the Application Load Balancer, it is essentially an HTTP load balancer that distributes HTTP requests among application back ends, which we also call targets. A regular application setup looks somewhat like this: on the right-hand side you have the target group, which contains targets. These targets can be EC2 instances inside a VPC running your application, or they can be compute that is on premises. They can also be containers. In front of the target group you have the load balancer, which receives requests from a client.

Typically, the way things work is that the load balancer is configured with an algorithm, for example round robin, and it uses this algorithm to decide which target to send that request to. With round robin, the first request goes to the first target, the second request goes to the second target, and this process continues, with each target processing thousands of requests concurrently.

Now, if we turn our attention toward a specialized class of applications, which I have called low-concurrency applications, the picture begins to change quite a bit. When I say low-concurrency applications, I mean applications that are compute intensive. Think of a large language model, where every request requires a whole bunch of compute resources to be fulfilled. As a result, each deployed instance of the target or the application or the model can only process a small number of requests at a time.

For these low-concurrency applications, the target group looks very different. Instead of having a small target group where each target is doing thousands of concurrent requests, you have a very large target group with a very large fleet size where each target can only process a few requests at a time. These could be as few as 100 or 10 or 5 or 2 or sometimes even 1. In my example, I've shown them to be able to process only one request at a time. When we are operating in this low-concurrency world, that is where traditional load balancing with its preconfigured algorithms starts to fall short of delivering the outcomes we would want from a load balancer.

Let's take the example of an application deployed on the target group that is doing some sort of media generation. Some applications could be doing video generation, some text generation, some image generation. Let's assume that each target can only process one request at a time and that the application is configured with round robin. Now, if at T equals zero all targets are busy processing requests, it's reasonable to assume that at a later point in time, some of them will have finished processing the requests, likely the ones that were doing text generation, while the ones that were doing more complicated tasks like video generation will still be busy.

At this point, if the load balancer receives requests, we would want those requests to go to the idle targets. But because the load balancer is configured with round robin, these requests will continue to use that algorithm and might end up on targets that are already busy.

As a result, the overall success rate of your application goes down. In other words, the error rate goes up. The requests that are denied have to be retried by the clients, and the perceived latency of the application increases. Meanwhile, we have targets sitting idle, waiting for requests they are not receiving. So the overall utilization and efficiency of your target group goes down.

Now some of you might say that this is because the example we've taken is that of round robin, and if we were to use something more advanced like least outstanding requests as the algorithm in the load balancer, then things would be much better. You would be right. Things would be much better, but they would still fall short of the outcome that we would want. Least outstanding requests is an algorithm where the load balancer selects the target that has the least number of requests. So things would definitely be better than round robin, but it would still fall short. That has a lot to do with the way the load balancer is designed.

Even though we refer to the load balancer as a unit, it actually comprises multiple independent units called nodes which are decoupled by design and which make routing decisions independently. The view of the target group as seen by one node is different from what the other node sees. If requests on the first node get routed by least outstanding requests and we have, let's say, five targets processing requests, this will not be the view that the second node sees. When requests arrive on it, they will be routed independently, and these requests might land on the same targets that already had requests from the first node.

As a result, we end up with the same three outcomes for low-concurrency applications. The success rate of the requests goes down, the error rate increases. The requests that are rejected have to be retried, so the client's perceived latency of the application increases. We have some targets that are idle and don't receive requests, so the overall utilization of the target group goes down.

How ALB Target Optimizer Solves the Problem with Agent-Based Routing

Now that we know the problem, let's talk about what the solution should be. The solution we would want is that the load balancer should know which targets are idle. So essentially, if targets D, E, H, and J are idle, we would want the load balancer to know those targets so that it can send the next few requests to those targets. Because the load balancer is decoupled, we would want the list of idle targets to be divided amongst the nodes in such a way that two nodes don't end up sending requests to the same target.

However, we would want our solution to be slightly more sophisticated than that. This is because the example we've been taking is where each target can only process one request at a time. In most cases, what we'll have is each target being able to process not one, but five, ten, or even a hundred requests at a time. If we have the picture looking somewhat like this, where target B and C can process one more request—in this picture I'm assuming each target can process two requests—so if we have the picture like this where B and C can process one more request and D and F can process two more requests, we would want the load balancer to know not only which target to send a request to, but also how many requests to send to that target.

Again, because it is decoupled, we would want those targets to be divided amongst the nodes in such a way that both nodes together don't end up sending more requests to a target than the target has the capacity to process. Now that we understand what the solution is, we can talk about how ALB Target Optimizer implements the solution. The way ALB Target Optimizer works is that on the target where your application is running, you install an agent that is provided by AWS. What this agent does is that it serves as a proxy between the load balancer nodes and the application. It also establishes control channels with the load balancer nodes.

On these control channels, it exchanges management traffic. This management traffic is necessary for the target optimizer to work. The moment the agent determines that the application is capable of processing one more request, it sends a signal to one of the nodes. The node registers that signal, and when a request arrives on the node, it knows that it can send that request to the target.

The way the agent knows when to send the signal is by tracking the number of requests the application is processing and comparing that with the max concurrent request configuration that you have specified on the agent. You must explicitly specify the number of concurrent requests that you want the application to process at maximum. For example, if you set the configuration to 3, you have configured the target to process 3 concurrent requests at maximum. When the application is indeed working on 3 requests, the agent will wait until one of those requests completes. When the application is down to 2 requests, it will send a signal to one of the load balancer nodes.

Let's say another request completes and the application is down to one request. The agent will send another signal to another load balancer node. These nodes will now know that if a request arrives, they can send that request to the target because the target is capable of processing it. In this picture I have only shown one particular target, but things will actually look like this for the entire target group. Every target will have the agent installed and configured on it. Each agent will be communicating with each load balancer node through these control channels. On each agent, you can configure the max number of concurrent requests that you want that target to receive from the load balancer. That number can be a function of the underlying capacity of the target, such as its memory and CPU.

We can expect two outcomes from the target optimizer. One is that we can expect a very high request success rate. This is because we know for certain that when a request arrives at the load balancer, it will go to a target which will have the capacity to process that request. It will not go to a busy target. So the overall request success rate goes up and your error rate goes down.

The second outcome is that we expect a very high utilization of the target group. This is because the moment capacity opens up on any one target, in other words, any one target finishes processing a request, the load balancer will send another request to the target. So the duration in which the target is idle is very small. The overall utilization of the target group increases, in other words, the efficiency increases. Because your target group is now more efficient, you can make do with a much smaller target group size than you had previously.

Another benefit that you get from the target optimizer is that in the event where all targets are busy processing the max number of requests they are configured to process, any request that lands on the load balancer will be rejected by the load balancer itself. So you have this automatic load shedding that happens. In other words, your targets are protected from additional load, from being hotspotted, and from being DOS attacked.

To recap the results, with the target optimizer, we can expect our request success rate to go up because fewer requests are rejected and fewer of them have to be retried. So the perceived latency of the application decreases. Your target group as a whole has its utilization or efficiency increased. Load shedding happens automatically if all targets are at capacity.

Setting Up Target Optimizer: Configuration Steps and Monitoring

Now let's talk about how you can set up the target optimizer. Setting up is a three-step process. In step one, you install the agent on the target. Remember, the agent is a proxy between the load balancer and the application. So you need to configure the port on which it receives traffic from the load balancer. This is your HTTP or HTTPS port, your port 443. You also need to configure the port to which it proxies that traffic, which is the destination port. This should be the port on which your application is listening. You also configure the control port on which the agent establishes these control channels with the load balancer. Once you have configured these three ports, you specify the max number of concurrent requests that you want that target to process.

There are four other optional variables that I will not go into, but these are related to TLS. The user guide talks more about them. Once you have installed the agent on each target, you register these targets with a new target group. I have called it the optimized target group. When you create this target group, you also need to specify the control port, which is the same as you configured on the agent. Once you have created this new target group, all that is left is to add it to an Application Load Balancer. If you already have an existing load balancer that is already configured with a listener, which is already sending traffic to an existing target group, you simply need to modify that listener rule and add this new target group to that listener. Once you have added it, you can simply move traffic over from the old target group to the new target group.

To recap the setup, you install the agent on your targets, you register these targets with a new target group, and then you add this new target group to a listener on a load balancer. The targets that are supported for Target Optimizer are EC2 instances and IPs. It also works with Amazon EKS and Amazon ECS services. For EKS and ECS, you simply run the agent as a sidecar container with your application. We also have new metrics to help troubleshoot and monitor Target Optimizer. You can monitor the active control channels between the load balancer and the agents, any channels that are running into errors, how many requests went to a target-optimized target group, how many of them were rejected, and also the control queue lengths. The control queue length is essentially the number of signals that the load balancer has received from the agents that are running on the targets.

Here are some resources about Target Optimizer. Please do read the launch blog. It has a step-by-step process on how you can set it up. That concludes this presentation. Thank you for attending.

; This article is entirely auto-generated using Amazon Bedrock.