mgd43b for AgentEnsemble

Posted on May 17 • Originally published at agentensemble.net

Capacity Management in Agent Networks: Rate Limiting, Priority Queues, and Backpressure

#java #ai #agents #architecture

Agent ensembles that run as long-lived services on a network will, at some point, receive more work than they can handle. The question is what happens next.

Without capacity management, the answer is usually one of: unbounded queue growth (OOM), random request dropping, or cascade failures where an overloaded ensemble backs up its callers. None of these are acceptable for systems that run in production.

The capacity problem in agent networks

Agent workloads have properties that make capacity management harder than in traditional request/response systems:

Variable execution time. A simple analysis task might take 5 seconds. A complex coding task might take 5 minutes. You cannot predict queue drain rate from request count alone.
Variable cost. Each agent iteration consumes LLM tokens. An overloaded system does not just slow down -- it burns money faster.
Non-deterministic behavior. An agent might complete in 3 iterations or 30. Capacity planning based on averages can be wildly wrong for individual requests.
Fan-out amplification. One incoming request to a coordinator might fan out to 5 different ensembles. Overload at one point amplifies across the network.

These properties mean that simple concurrency limits are necessary but not sufficient. You also need priority ordering, starvation prevention, and the ability to adjust capacity proactively.

Rate limiting

The first line of defense is limiting how many tasks an ensemble processes concurrently:

Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .maxConcurrent(10)
    .build();

When the concurrency limit is reached, new requests queue. The queue has a configurable maximum depth. When the queue is also full, the ensemble rejects new requests with a backpressure signal.

The backpressure signal propagates upstream: the calling ensemble receives a rejection and can retry, route to an alternative provider (via discovery), or report the failure to its own caller.

This is straightforward but effective. The ensemble protects itself from overload, and the backpressure signal gives callers information they need to make routing decisions.

Priority queues with aging

Not all requests are equally urgent. A VIP guest's meal request should be processed before a routine inventory check. Priority queues handle this, but naive priority queues have a starvation problem: low-priority requests may never be processed if high-priority requests keep arriving.

The PriorityRequestQueue adds aging to prevent starvation:

PriorityRequestQueue queue = PriorityRequestQueue.builder()
    .requestQueue(baseQueue)
    .levels(3)
    .agingInterval(Duration.ofMinutes(5))
    .build();

// Enqueue with priority
queue.enqueue(vipRequest, 0);       // highest priority
queue.enqueue(normalRequest, 1);    // normal
queue.enqueue(batchRequest, 2);     // lowest priority

Aging works by promoting requests that have waited longer than the aging interval. A batch request (priority 2) that has been waiting for 10 minutes (two aging intervals) gets promoted twice to priority 0. It will be processed next, regardless of incoming high-priority requests.

This guarantees that every request is eventually processed, while still giving meaningful priority to urgent work in the common case.

Operational profiles

Rate limits and priorities handle reactive capacity management -- responding to load as it arrives. Operational profiles handle proactive capacity management -- adjusting capacity in anticipation of known load changes.

A NetworkProfile bundles per-ensemble capacity targets and shared memory pre-load directives into a deployable unit:

NetworkProfile weekendProfile = NetworkProfile.builder()
    .name("sporting-event-weekend")
    .ensemble("front-desk", Capacity.replicas(4).maxConcurrent(50))
    .ensemble("kitchen", Capacity.replicas(3).maxConcurrent(100))
    .preload("kitchen", "inventory", "Extra beer and ice stocked")
    .build();

ProfileApplier applier = new ProfileApplier(sharedMemoryRegistry, broadcaster);
applier.apply(weekendProfile);

When a profile is applied, two things happen:

Pre-load directives seed shared memory scopes with context (e.g., alerting the kitchen about extra stock)
A ProfileAppliedMessage is broadcast to all ensembles with the new capacity targets

The broadcast message includes replica counts and concurrency limits. Consumers of this message (a Kubernetes operator, a scaling script, or the ensembles themselves) can use the targets to trigger actual scaling.

Scheduled profiles

Profiles can be applied on a schedule:

ProfileScheduler scheduler = new ProfileScheduler(applier);

// Apply weekend profile every 7 days
scheduler.schedule(weekendProfile,
    Duration.ofHours(2),    // initial delay
    Duration.ofDays(7));    // interval

// One-shot: return to normal after the weekend
scheduler.scheduleOnce(normalProfile, Duration.ofDays(3));

Directive-driven profiles

Profiles can also be applied via the directive system, enabling external triggers:

NetworkProfileDirectiveHandler handler =
    new NetworkProfileDirectiveHandler(applier, profiles);

directiveDispatcher.registerHandler("APPLY_PROFILE", handler);

An external system (monitoring, a human operator, a scheduler) sends an APPLY_PROFILE directive with the profile name, and the network adjusts.

Scheduled tasks

Long-running ensembles often need to perform recurring work: inventory checks, health monitoring, report generation. The ScheduledTask API makes this a first-class concern:

Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .scheduledTask(ScheduledTask.builder()
        .name("inventory-check")
        .task(Task.of("Check current inventory levels and generate report"))
        .schedule(Schedule.every(Duration.ofHours(1)))
        .broadcastTo("hotel.inventory")
        .build())
    .scheduledTask(ScheduledTask.builder()
        .name("equipment-check")
        .task(Task.of("Verify all kitchen equipment is operational"))
        .schedule(Schedule.every(Duration.ofHours(12)))
        .build())
    .build();

Scheduled tasks run alongside the ensemble's normal work processing. They use the same concurrency limits -- a scheduled task counts against maxConcurrent the same as an incoming request. This prevents scheduled tasks from starving request processing.

The optional broadcastTo sends the task result to a named topic, where other ensembles can consume it as context.

Audit trail

For operational visibility, the audit trail captures significant events across the network:

Work request received/completed/failed
Capacity changes (profile applied, scaling events)
Discovery events (capability registered/deregistered)
Federation events (cross-realm routing)

The audit trail is append-only and can be backed by the same transport infrastructure as the rest of the network (in-memory for development, Kafka for production).

Tradeoffs

Rate limits are blunt. A concurrency limit of 10 treats all tasks equally. A quick 5-second task and a 5-minute coding task both count as one. For workloads with highly variable task duration, adaptive concurrency (adjusting limits based on observed completion times) would be more effective, but adds complexity.

Profile-based scaling is manual. Operational profiles define target capacities, but the actual scaling (adjusting K8s replicas, for instance) must be performed by an external system. The profile broadcast is a signal, not an actuator.

Priority is caller-declared. The caller assigns priority when enqueuing a request. There is no built-in mechanism to enforce priority policies or prevent callers from marking everything as high priority. Priority policies need to be enforced by convention or middleware.

Aging is time-based. Requests age based on wall-clock time, not queue depth or system load. Under sustained high load, aging may promote low-priority requests sooner than desired. Under low load, aging is irrelevant because everything is processed quickly anyway.

The design principle

Capacity management for agent networks needs three layers:

Reactive protection -- rate limits and backpressure to prevent overload in real time
Priority ordering -- ensuring urgent work is processed first, with aging to prevent starvation
Proactive adjustment -- operational profiles to scale capacity before anticipated load changes

Each layer addresses a different time horizon: seconds (rate limits), minutes (priority queues), and hours/days (operational profiles). Together, they give operators the tools to keep an agent network running under variable load without manual intervention for routine capacity changes.

Capacity management is part of AgentEnsemble. The rate limiting guide, operational profiles guide, and scheduled tasks guide cover the full APIs.

I'd be interested in how others handle capacity management in agent networks -- especially whether adaptive concurrency limits (based on observed task duration) have been useful in practice.

DEV Community