Sync-over-Async: Bypassing Azure Service Bus Session Limits for AI Workloads

#ai #azure #azureservicebus #microsoft

How to bridge legacy HTTP clients to long-running AI tasks without 504 Timeouts or Stateful Bottlenecks.

The business wants you to integrate a new LLM feature. You wire up a
standard REST endpoint, deploy it, and it works flawlessly in testing. Then it hits production. The AI takes 45 seconds to generate a response during peak load. Your API Gateway drops the connection at 30 seconds. The client gets a 504 Gateway Timeout, the user furiously clicks retry, and suddenly you have a thundering herd that takes down your entire connection pool.

Welcome to the era of AI workloads on legacy HTTP infrastructure.

Standard REST APIs are built for speed. AI workloads are fundamentally slow. If you do not decouple them, your architecture will eventually shatter under the weight of holding thousands of long-running HTTP threads open.

The "Anti-Pattern" Lifeline: Sync-over-Async

In a perfect world, your clients would be fully event-driven, communicating over WebSockets or Server-Sent Events. In the real world, you have legacy mobile apps, older frontends, and strict partner webhooks that only speak one language: they send an HTTP POST and they expect a 200 OK with a JSON payload immediately. You cannot force them to implement an Azure Service Bus listener.

This is where the Sync-over-Async Gateway comes in.

It is an edge integration pattern where a Gateway receives a synchronous HTTP request, converts it into an asynchronous message on a broker (like Azure Service Bus), waits for the backend worker to process it, and then maps the reply back to the original HTTP connection.

Azure Service Bus Sessions

When engineers build this on Azure, the immediate instinct is to use Service Bus Sessions.

The Gateway sends a message with SessionId = 123.
The Gateway blocks and listens to a reply queue exclusively for SessionId = 123.
The Worker processes the task and sends the reply with SessionId = 123.

This works beautifully on a single machine. At scale, it could be a disaster.

If you have 50 Gateway instances behind a load balancer, how does the reply get back to the exact instance holding the open HTTP connection? If you use Sessions, your system becomes deeply stateful. Instance #1 has to explicitly request the lock for Session 123. If Instance #1 crashes, that session is locked until it times out. Furthermore, Azure Service Bus Standard tier enforces hard limits on concurrent sessions, meaning a traffic spike will instantly exhaust your namespace.

Sessions force you to manage stateful routing across a distributed cluster. It breaks horizontal elasticity.

The Fix: Stateless Filtered Topics

To achieve true horizontal scale, the Gateway layer must be 100% stateless. Instead of using locked sessions, we can push the routing logic down to the broker using a Filtered Topic Pattern.

Explicit Addressing: The Gateway injects a unique ReplyToInstance property into the request (e.g., Instance-A).
Dynamic Subscriptions: On startup, each Gateway creates a lightweight, temporary subscription on a global reply topic with a SQL rule: ReplyToInstance = 'Instance-A'.
Broker-Side Routing: When the backend worker finishes, it attaches the same property to the reply. The Azure broker evaluates the SQL filter and pushes the message only to the specific Gateway pod waiting for it.

No session locks. No implicit instance affinity. Complete horizontal scalability.

Breaking Down the Stateless Architecture

If you look at the architecture diagram above, here is exactly how we eliminate the Session bottleneck and achieve infinite horizontal scale:

1. The Synchronous Edge (Left Side)
The client sends a standard, blocking HTTP REST request. Our Load Balancer distributes this to any available Gateway Replica (e.g., Replica 1). Because our Gateway is completely stateless, the load balancer doesn't need to worry about sticky sessions.

2. The Asynchronous Handoff (Middle)
Replica 1 takes the HTTP payload and publishes it to the Azure Service Bus Request Topic.
Crucially, it does NOT open a Service Bus Session. Instead, it generates a unique CorrelationId (e.g., replica1_reqA) and includes it in the message properties. Immediately, Replica 1 spins up a lightweight, dynamic subscription on the Reply Topic with a strict SQL Filter: CorrelationId = 'replica1_reqA'.

3. The AI Worker Layer (Right Side)
Your long-running AI workers operate as standard, competing consumers. A worker pulls the request from the topic, processes the heavy LLM prompt for 45 seconds, and generates the result. To send the result back, the worker simply attaches that exact same CorrelationId to the response message and drops it onto the global Reply Topic.

4. Broker-Side Routing (The Magic)
This is where the architecture shines. The Gateway instances are not actively polling or fighting over locked sessions. The Azure Service Bus broker evaluates the incoming reply message, reads CorrelationId = 'replica1_reqA', matches it to Replica 1's dynamic SQL filter, and pushes the message directly down that specific pipe.

Replica 1 receives the answer, maps it back to the open HTTP thread, and returns the 200 OK to the client. If Replica 1 had crashed during those 45 seconds, its temporary subscription would simply vanish—no locked sessions, no frozen resources, and no blocked queues.

Introducing Sentinel: The Open-Source Starter

Implementing dynamic Service Bus Administration clients, processor lifecycles, and thread management is complex. To solve this, I built Sentinel—an open-source Spring Boot starter that abstracts this entire pattern into a single library dependency.

Here is how you can completely decouple your HTTP APIs from your slow AI workers in just a few lines of code.

1. Add the Dependency

<dependency>
    <groupId>io.github.shivamsaluja</groupId>
    <artifactId>sentinel-servicebus-starter</artifactId>
    <version>1.0.0</version>
</dependency>

2. The Zero-Boilerplate Configuration
Sentinel handles all the Azure SDK heavy lifting. Just point it to your queues in application.yml:

sentinel:
  servicebus:
    connection-string: "Endpoint=sb://your-namespace.servicebus.windows.net/;SharedAccessKeyName=...;"
    request-queue: "ai-task-requests"
    reply-topic: "ai-task-replies"

3. The Gateway Controller (The Magic)
By returning a CompletableFuture, we instantly free up the Tomcat HTTP thread. The client's connection remains open, but the server resources are released, allowing massive concurrency.

@RestController
@RequestMapping("/api/v1/ai")
public class GatewayController {

    private final SentinelTemplate sentinelTemplate;

    public GatewayController(SentinelTemplate sentinelTemplate) {
        this.sentinelTemplate = sentinelTemplate;
    }

    @PostMapping("/generate")
    public CompletableFuture<ResponseEntity<String>> generateReport(@RequestBody String prompt) {

        // 1. Send the prompt to the Service Bus and wait.
        // Under the hood, Sentinel manages the dynamic SQL subscription.
        CompletableFuture<String> asyncReply = sentinelTemplate.sendAndReceive(prompt);

        // 2. Map the asynchronous reply back to a standard HTTP 200 OK.
        return asyncReply.thenApply(reply -> {
            return ResponseEntity.ok(reply);
        }).exceptionally(ex -> {
            return ResponseEntity.internalServerError().body("Task failed: " + ex.getMessage());
        });
    }
}

4. The Backend Worker Contract
Your backend workers remain standard, dumb, asynchronous consumers. They just need to respect the routing contract by passing the properties back.

private void processAIRequest(ServiceBusReceivedMessageContext context) {
    ServiceBusReceivedMessage request = context.getMessage();

    // Extract the routing property injected by the Sentinel Gateway
    String replyToInstance = (String) request.getApplicationProperties().get("ReplyToInstance");

    // ... (Simulate slow AI processing taking 45 seconds) ...
    String aiResponse = "Generated Report Data...";

    ServiceBusMessage replyMessage = new ServiceBusMessage(aiResponse);
    replyMessage.setCorrelationId(request.getCorrelationId());

    // CRITICAL: Attach the routing property so Azure knows which pod gets the reply
    if (replyToInstance != null) {
        replyMessage.getApplicationProperties().put("ReplyToInstance", replyToInstance);
    }

    senderClient.sendMessage(replyMessage);
}

The Result

By dropping the Session requirement, your API Gateway layer becomes infinitely horizontally scalable. You can deploy 10 pods or 1,000 pods. The Azure Service Bus handles all the complex routing logic on the broker side, and your legacy clients get their synchronous 200 OK—no matter how long the AI takes to think.

If you are dealing with timeout issues or brittle edge-integration architectures, check out the project on GitHub.

🔗 Sentinel Service Bus Starter on GitHub

I would love to hear your thoughts, feedback, or horror stories about managing Service Bus sessions in the comments below!