In LLM applications with limited concurrency, can ALB Target Optimizer enable concurrency control and corresponding scaling?

#aws

Answer

Yes, it is possible. However, since scaling is reactive—occurring only after requests are rejected—some clever workarounds might be necessary for production use. I conducted some tests but haven't found a perfect solution yet, so if anyone has any insights, please let me know.

Application Load Balancer Target Optimizer

Announced in November 2025, this feature controls traffic by running an "ALB Agent" on the target side, allowing for information exchange between the ALB and the Agent. A primary use case is for LLM applications where a target instance can only handle one or two requests simultaneously. By controlling concurrency, it aims to prevent excessive load on individual instances.

This got me thinking: the feature reduces the load on the target by having the ALB return errors once the set concurrency limit is reached. In such a scenario, there’s naturally a need to scale the targets. I wondered if there was a way to auto-scale precisely when that concurrency limit is hit.

For instance, in Google Cloud Run, you can explicitly control concurrency per instance, and it automatically scales when an instance can no longer handle the load. My understanding was that we can now achieve similar control for instances behind an ALB based on per-instance concurrency. So, I decided to experiment.

Step 1. Preparation

To implement an intentional concurrency "overflow," I set up the environment referring to the following blog posts:

https://aws.amazon.com/jp/blogs/networking-and-content-delivery/drive-application-performance-with-application-load-balancer-target-optimizer/

https://dev.classmethod.jp/articles/try-aws-alb-target-optimizer/

First, I configured three instances with TARGET_CONTROL_MAX_CONCURRENCY set to 1 and confirmed that I could perform parallel load testing.

Step 2. Finding Metrics for Scaling

Quoting from the AWS Blog:

You can troubleshoot using the following metrics in CloudWatch:
TargetControlRequestCount: Number of requests forwarded by ALB to the agent.
TargetControlRequestRejectCount: Number of requests rejected by ALB due to no targets being ready to receive requests. This metric shows an uptick when TargetControlWorkQueueLength is zero.
TargetControlActiveChannelCount: Number of active control channels between the ALB and agents. Ideally, this should be equal to the number of agents.
TargetControlNewChannelCount: Number of new channels created between the ALB and agents.
TargetControlChannelErrorCount: Number of control channels between ALB and agents that failed to establish or experienced an unexpected error.
TargetControlWorkQueueLength: Number of signals received by the ALB from agents asking for requests.
TargetControlProcessedBytes: Number of bytes processed by ALB for traffic going to target groups that enable target optimizer.

An increase in TargetControlRequestRejectCount suggests that the system can no longer process requests and needs to scale. However, this means 503 errors are already being returned to the user, so you would need to implement client-side retries or other workarounds. While TargetControlWorkQueueLength dropping to zero also signals that rejections are about to start, both metrics effectively trigger at the same timing.

Step 3. Testing Actual Scaling

Next, let's set up a CloudWatch Alarm using the TargetControlRequestRejectCount value to trigger auto-scaling. First, I created the Alarm.

Then, I linked it to the Auto Scaling Group. Under these settings, I applied load using the tool mentioned earlier and confirmed that the system does indeed scale out.

Summary

Using ALB Target Optimizer, you can scale compute resources when pre-set concurrency limits are exceeded. For use cases like LLM applications, where a single request is heavy and uneven load distribution across instances is problematic, this mechanism works well by letting the Agent signal the ALB whether it can handle more traffic.

In this experiment, I used TargetControlRequestRejectCount to trigger a scale-out, but this results in "reactive" scaling—scaling only after the user has already been impacted. To ensure a smooth user experience, further refinement is needed. If anyone has any ideas, I'd love to hear them.

The fact that a dedicated Agent communicates with the ALB to allow for more granular traffic control is a mechanism previously unseen in ALB. I'm excited to see how this functionality evolves in the future.