DEV Community

Sivagurunathan Velayutham
Sivagurunathan Velayutham

Posted on

How I Built a Claude Router with Structured Concurrency and Virtual Threads

Introduction

Recently I read Netflix's blog post on Virtual Threads and how they improved their backend system performance. This led me to explore how Virtual Threads and StructuredTaskScope work internally. In this post, I'll explain Virtual Threads, then show how to use them in a practical project with benchmarks.

Background

Threads

Before diving into VThreads, let's take a step back and understand Threads.

One of the standard textbook definitions is

Threads were light weight process running along with your application process.
Enter fullscreen mode Exit fullscreen mode

Let's breakdown the above statement.

First part is "Light weight process" which means they have less memory foot print than regular process. Threads were typically stored in Stack (temp memory) once the lifetime of thread is reached, the associated memory will be released. Second part "along with your application process" - All threads were still managed by the main application process. Although there's a gotcha: there's a possibility of thread leaks if process didn't clean up threads properly.
All threads internally mapped to scheduler inside OS, where each thread wake up and execute the task and return. OS handle the heavy lifting of how scheduling should happen i.e RoundRobin, Priority etc.

Each platform thread consumes ~1MB of stack memory. With 200 threads, that's 200MB just for thread stacks. Under high load, request 201 must wait even though threads 1-200 are just sitting idle waiting for I/O responses.

One of the main drawbacks of threads: when used in I/O intensive applications and each thread blocked until I/O response comes back till then thread will be halted/idle - results in wastage of thread resource. Think of scenario, where your web server handling request (1 request to 1 thread). Under high load, there is a limit of how many parallel request your system depends on max number of threads supported by the operating system.

// Traditional thread-per-request model
ExecutorService executor = Executors.newFixedThreadPool(200);

for (int i = 0; i < 1000; i++) {
    executor.submit(() -> {
        // This thread is BLOCKED during the entire HTTP call
        HttpResponse response = httpClient.send(request); // ~100ms wait
        return process(response);
    });
}
// Request 201-1000 must WAIT - all 200 threads blocked on I/O!
Enter fullscreen mode Exit fullscreen mode

Virtual Threads

With Project Loom, to address the issue of thread resource contention JDK 21 introduced Virtual Threads. Think of it as an abstraction where virtual threads are managed by JVM and assigned to actual platform threads (normal threads managed by OS). Now instead of depending OS constraint, JVM maintains the virtual threads. With JVM having full control, it can pause the threads and resume when I/O operation completed (either success or failure).

Now you will have important question, how JVM knows when to pause and resume threads?
JDK decides based on VThreads vs platform threads, blocks (called as "park" in JDK code) until unparked or interrupted. How does it do? JVM suspends the continuation (snapshot of stack) and frees platform(OS thread). When operation completed, virtual thread resume from the exact snapshot.

In each JDK, there is a common set of blocking API calls (like Socket read/write). When JVM detects one of the calls, it will rewrite into non-blocking API (using linux epoll). After rewriting JVM stores the virtual thread stack and variables, freeing platform/OS threads to run other virtual threads. Once the virtual threads unblocked, JVM will restore the virtual threads stack and resume the execution. With this, JVM can run many virtual threads to limited set of platform threads.

To move to virtual threads, the code change in JDK21+ is simply to switch:

// Before: Platform threads (1:1 with OS) - each thread ~1MB
ExecutorService executor = Executors.newFixedThreadPool(200);

// After: Virtual threads (M:N with OS) - each virtual thread ~1KB
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
Enter fullscreen mode Exit fullscreen mode

The key insight: virtual threads don't block OS threads. A virtual thread waiting for I/O is just a Java object in heap (~1KB), not a blocked OS thread (~1MB). JVM unmounts the virtual thread from carrier thread, freeing it to run other virtual threads.

Structured Concurrency

StructuredTaskScope (finalized in Java 25) enforces a simple rule: tasks cannot outlive their scope.

Traditional concurrency has fundamental issues:

  • Thread leaks: Tasks can outlive the method that created them
  • Manual cancellation: Must remember to cancel remaining tasks on partial failure
  • Complex cleanup: try/catch/finally blocks become unwieldy

StructuredTaskScope solves this with structured lifetime:

try (var scope = StructuredTaskScope.open(Joiner.awaitAllSuccessfulOrThrow())) {
    Subtask<User> userTask = scope.fork(() -> fetchUser(id));
    Subtask<Orders> ordersTask = scope.fork(() -> fetchOrders(id));

    scope.join(); // Wait for all
    return new Dashboard(userTask.get(), ordersTask.get());
} 
Enter fullscreen mode Exit fullscreen mode

Built-in Joiner strategies:

  • awaitAllSuccessfulOrThrow() - Wait for all tasks to complete, fail if any fails
  • anySuccessfulResultOrThrow() - Return first successful result and cancel rest

Let's take a real world example. With the rise in LLM usage, a common use case is choosing the right model for the right task.

Imagine building an intelligent LLM router that takes a prompt and routes to the best model. For finding the fastest model, we'll use a racing pattern: send requests to all models simultaneously, return whichever responds first, and cancel the rest. Each race result gets recorded by a metrics collector, tracking win rates and latency per model. Over time, the router learns which model consistently wins and starts routing directly to it, skipping unnecessary API calls.

Claude offers three model tiers: Haiku (fastest, cheapest), Sonnet (balanced), and Opus (most capable, slowest).

For this project, I built a simple HTTP server using Javalin that exposes a /chat endpoint. When a request comes in, the router races all three models, returns the fastest response, and tracks metrics. The server runs on Java 25 with virtual threads enabled.

Let's look at the core racing logic. Without StructuredTaskScope, the code is messy:

private LLMResponse raceModelsTraditional(List<Model> models, LLMRequest request) {
    ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();
    CompletableFuture<LLMResponse>[] futures = new CompletableFuture[models.size()];
    AtomicBoolean completed = new AtomicBoolean(false);

    try {
        // Create futures for each model
        for (int i = 0; i < models.size(); i++) {
            Model model = models.get(i);
            futures[i] = CompletableFuture.supplyAsync(() -> {
                if (completed.get()) {
                    throw new CancellationException("Race already won");
                }
                return executeModel(model, request);
            }, executor);
        }

        // Wait for first successful result
        CompletableFuture<Object> anyOf = CompletableFuture.anyOf(futures);
        LLMResponse winner = (LLMResponse) anyOf.get(30, TimeUnit.SECONDS);
        completed.set(true);

        // Manually cancel remaining futures
        for (CompletableFuture<LLMResponse> future : futures) {
            if (!future.isDone()) {
                future.cancel(true);
            }
        }

        return winner;

    } catch (TimeoutException e) {
        // Cancel all on timeout
        for (CompletableFuture<LLMResponse> future : futures) {
            future.cancel(true);
        }
        return handleError(e);
    } catch (Exception e) {
        return handleError(e);
    } finally {
        executor.shutdown();
    }
}
Enter fullscreen mode Exit fullscreen mode

With StructuredTaskScope you can simply change to:

private LLMResponse raceModels(List<Model> models, LLMRequest request) {
    try (var scope = StructuredTaskScope.open(
            Joiner.<LLMResponse>anySuccessfulResultOrThrow())) {

        // Fork concurrent tasks
        for (Model model : models) {
            scope.fork(() -> executeModel(model, request));
        }

        // Wait for first success - others auto-cancelled
        return scope.join();

    } catch (Exception e) {
        return handleError(e);
    }
}
Enter fullscreen mode Exit fullscreen mode

No manual cancellation. No thread leaks. No forgotten cleanup.

With the router implemented, I wanted to see if virtual threads actually deliver on their promise. I ran the server under load and compared both approaches.

Benchmarks

Virtual Threads vs Platform Threads

10,000 requests, 1,000 concurrency

Metric Platform Virtual Improvement
Throughput 1,530 req/s 3,078 req/s 2x
P50 Latency 475ms 103ms 4.6x
P95 Latency 1,276ms 420ms 3x

Racing Router Results

Metric Value
P50 Latency 96ms
HAIKU Win Rate 96%
Cost Savings 95.9%

The router automatically identified HAIKU as fastest and transitioned to single-model mode after 500 races.

For detailed benchmarks, see the GitHub repo.

Conclusion

Java 25's structured concurrency changes how we write concurrent code:

  • Virtual Threads: One-line change, 2x throughput
  • StructuredTaskScope: Safe task lifecycle, automatic cancellation
  • Racing pattern: Complex manual code becomes simple with cleanup built-in

A word of caution: Virtual threads aren't a silver bullet. Watch out for pinning, where a virtual thread gets stuck on a carrier thread and can't unmount. This happens when code holds a synchronized block or calls native methods via JNI during blocking operations. When pinned, the virtual thread behaves like a platform thread, losing its scalability benefits. Prefer ReentrantLock over synchronized when using virtual threads, and monitor for pinning with -Djdk.tracePinnedThreads=short.

Source: github.com/SivagurunathanV/claude-router

Top comments (0)