DEV Community

Hagicode
Hagicode

Posted on • Originally published at docs.hagicode.com

Solving Backend Distributed Challenges in AI Programming Workbenches with Orleans

Solving Backend Distributed Challenges in AI Programming Workbenches with Orleans

Managing dozens of AI CLI tools in a single process while pulling real-time streams across dozens of sessions—sounds like a pipe dream? Honestly, we thought it was pretty ridiculous too. But Orleans's Virtual Actor model actually tames this complexity quite beautifully. You know how some tools are born to solve specific problems? You just don't realize how perfect they are until you actually encounter that problem.

Background

Building an AI programming workbench product has a unique architectural quirk: each user session is essentially a living, stateful "lifeform" that can engage with you for an hour or two. When a user drops in a message, the system needs to pick an appropriate AI Provider—Claude Code, Codex, Gemini, Kimi, CodeBuddy, etc., just listing the names takes fingers and toes—then spin up subprocesses, push execution results back in real-time through streaming channels, and synchronize various state changes over SignalR.

If you tried to tackle this with a traditional stateless HTTP + Redis approach, you'd face some headaches:

  1. Multi-provider management is fragmented everywhere. Each AI CLI tool has its own process model, streaming output format, and timeout temperament. Mixing a dozen different logics together, the code quickly becomes—you know—spaghetti. Not that it's inedible, it just gives you indigestion.

  2. Timeouts are uncontrollable, it's pure luck. An AI operation might finish in three minutes, or drag on for two hours. Use a global unified timeout config? Then short operations get pointlessly cut off—think about it, that's unfair to users. Conversely, long operations chewing up the thread pool isn't a pretty picture either.

  3. Concurrency needs careful budgeting, since GPUs don't fall from the sky. Run too many AI operations simultaneously and machine resources max out immediately; but being too conservative means wasting compute power you paid for—it's like cranking the AC to 16 degrees and then piling on quilts. You need precise control over active session counts based on global permissions.

  4. State management is complex enough to make you question life choices. Each session has its own message queue, phase state, bound executor—this is stateful data. Forcing it into a stateless HTTP model means using Redis as universal glue. It sticks, sure, then you discover you've written a mountain of serialization/deserialization and distributed lock logic. After finishing, you stare at the screen: am I solving business problems, or fighting infrastructure?

These problems together aren't so much a technical challenge as a soul-searching interrogation of your architecture choices.

About HagiCode

These ideas didn't materialize out of thin air. The solution shared in this article comes from our real-world battle scars in the HagiCode project. HagiCode is a desktop workbench for AI-assisted collaborative programming. Its backend coordinates dozens of AI CLI tools in a single process while providing low-latency real-time responses to the frontend—basically, we want the horse to run, not eat grass, and sing while running.

The Orleans architecture discussed below is exactly what we honed and optimized during HagiCode's development. If you find this solution interesting, it shows our engineering foundation isn't bad—so HagiCode itself might be worth a closer look.

Selection: Why Orleans

Facing that soul-searching interrogation above, we seriously considered three paths:

Approach A: Stateless API + Redis state management. The logic is simple enough—pull session state from Redis per request, execute, write back. Horizontal scaling feels comfortable, but the Redis state structure grows alongside the business until you're not sure if you're maintaining a cache or an implicit database. State consistency relies on locks, streaming communication requires additional WebSocket/SSE routing layers. Essentially, Redis is just a shared dictionary here—it can't provide the stateful abstraction you really need.

Approach B: Actor model frameworks (Dapr / Akka.NET). Dapr's Actor capability is sufficient, but it requires deploying a Sidecar—for local desktop products, that's not just overkill, it's like driving a tank to buy groceries. Akka.NET's Actor model leans more toward low-latency short tasks. For long-lifecycle workflows lasting an hour or two, you have to handle persistence and recovery yourself—the framework doesn't provide a safety net.

Approach C: Microsoft Orleans. When we saw Orleans's Virtual Actor model—how to put it—it felt like searching everywhere for your keys, then realizing they're in your pocket. Several features seem literally tailor-stitched for our scenario:

  • Automatic Activation/Deactivation management: You don't worry about when grains are born or die—the runtime handles it all. One session corresponds to one grain, session alive means grain alive, session ends means grain auto-recycled. This "don't have to care" feeling—only people who've experienced manual lifecycle management understand.

  • Native streaming support with IAsyncEnumerable<T>: From CLI process output to frontend display, fully async streaming throughout, no intermediate buffer queues needed. Just this one feature saved us at least a thousand lines of hand-written glue code.

  • [AlwaysInterleave] and [ResponseTimeout]: Fine-grained concurrency and timeout control, configured per interface level, not a global一刀切. Finally no more painful choices between "all short" or "all long."

  • Built-in persistent state (IPersistentState<T>): State auto-persists, no need for extra distributed cache setup. Peace of mind, seriously.

In evaluation, Orleans almost perfectly checked all boxes for HagiCode backend's core needs:

Capability Orleans Solution
Stateful sessions IPersistentState<T> + SQLite Shard persistence
Streaming output IAsyncEnumerable<T> native support, auto-penetrates to SignalR
Long timeout control [ResponseTimeout("02:00:00")] configured per interface granularity
Provider polymorphic routing ExecutorGrainFactory dispatches based on AIProviderType
Concurrency control SessionConcurrencyManager with grain single-threaded scheduling

Five Core Design Decisions

Choosing the right tool is just step one. How to implement it is where the real skill shows. Below are five key designs we distilled after stepping into pits, climbing out, and brushing ourselves off. Some are experience, some are lessons, some... forget it, just read them yourself.

1. Facade Grain Pattern

The system's core scheduling grain is SessionGrain. But it doesn't directly handle all logic—if it did, it would become a god class with tens of thousands of lines. God classes—you feel omnipotent writing them, worthless modifying them.

We delegate domain-specific logic to two runtime components: ChatSessionGrain handles chat mode, ProposalSessionGrain handles proposal mode.

internal partial class SessionGrain(
    ILogger<SessionGrain> logger,
    IServiceProvider serviceProvider,
    IExecutorGrainFactory executorGrainFactory,
    IMessageService messageService,
    [PersistentState("session")] IPersistentState<SessionState> state)
    : Grain, ISessionGrain
{
    internal ChatSessionGrain ChatSessionComponent =>
        _chatSessionComponent ??= new ChatSessionGrain(RuntimeContext);

    internal ProposalSessionGrain ProposalSessionComponent =>
        _proposalSessionComponent ??= new ProposalSessionGrain(RuntimeContext);

    internal ISessionRuntimeComponent GetRuntimeComponent(SessionType sessionType) =>
        sessionType switch
        {
            SessionType.Chat => ChatSessionComponent,
            SessionType.Proposal => ProposalSessionComponent,
            _ => throw new ArgumentOutOfRangeException(nameof(sessionType))
        };
}
Enter fullscreen mode Exit fullscreen mode

This pattern is clean: grain identity is stable, doesn't change with session type; external callers only deal with ISessionGrain, they don't care how work is split internally; components themselves are stateless, can be rebuilt on demand; both share the same SessionState persistence, data consistency naturally handled. Who says architectural design can't be elegant?

2. Polymorphic Executor Factory

HagiCode supports a dozen AI CLI tools, each requiring independent process management and streaming output. We implemented a dedicated grain for each tool—ClaudeCodeGrain, CodexGrain, GeminiGrain, etc., listing them like roll call. Then we rely on a factory for unified routing:

internal sealed class ExecutorGrainFactory : IExecutorGrainFactory
{
    public IExecutorStreamGrain GetExecutorGrain(
        AIProviderType executorType, CessionId cessionId)
    {
        return executorType switch
        {
            AIProviderType.ClaudeCodeCli => ExecutorStreamGrainAdapter.From(
                _grainFactory.GetGrain<IClaudeCodeGrain>(cessionId.Value)),
            AIProviderType.CodexCli => ExecutorStreamGrainAdapter.From(
                _grainFactory.GetGrain<ICodexGrain>(cessionId.Value)),
            AIProviderType.GeminiCli => ExecutorStreamGrainAdapter.From(
                _grainFactory.GetGrain<IGeminiGrain>(cessionId.Value)),
            // ... 10+ providers
            _ => throw new NotSupportedException(
                $"Unsupported executor type: {executorType}")
        };
    }
}
Enter fullscreen mode Exit fullscreen mode

All executor grains implement the same IExecutorStreamGrain interface, with unified adaptation via ExecutorStreamGrainAdapter. Upper-layer code is completely unaware of which Provider is being used underneath—add a new tool? Add a new grain class, add one line to the factory switch, done. This extension point—how to put it—is like leaving a door for your future self, and behind the door is no complex maze, just walk straight in.

3. Streaming Communication Pipeline

Orleans's native support for IAsyncEnumerable<T> makes streaming output particularly natural. Taking ClaudeCodeGrain as an example:

public async IAsyncEnumerable<ClaudeCodeResponse> ExecuteCommandStreamAsync(
    string command,
    string? heroId,
    [EnumeratorCancellation] CancellationToken token = default)
{
    var (provider, configuration) = await CreateProviderAsync(heroId, token);

    await foreach (var response in SendAsync(command, provider, context, token))
    {
        yield return response;
    }
}
Enter fullscreen mode Exit fullscreen mode

The entire pipeline looks like this: CLI process stdout → grain streaming yield → ExecutorGrainFactory wraps as SessionMessageSessionGrain pushes to frontend via SignalR. Every step is async streaming, no intermediate buffering, no synchronous blocking. This is also one of the best parts about Orleans compared to traditional approaches—you don't need to maintain a ConcurrentQueue inside the grain and manually push, yield return four characters handle everything. This fluency—once you've experienced it, there's no going back.

4. Layered Timeout Strategy

AI operations have extreme time variance—a simple syntax correction might take 3 seconds, a complex refactoring could run for two hours. One-size-fits-all timeout strategy? What gets cut isn't the knife.

We configure in layers: Silo level defaults to 30 second timeout, individual interfaces override via [ResponseTimeout]:

public static class GrainTimeouts
{
    public const string LongRunningResponseTimeout = "02:00:00";
    public const string HealthCheckResponseTimeout = "00:01:00";
}

[Alias("HagiCode.Orleans.IAIGrain")]
public interface IAIGrain : IGrainWithStringKey
{
    [ResponseTimeout(GrainTimeouts.LongRunningResponseTimeout)]
    Task<ProposalOptimizationBundleResultDto> OptimizeProposalBundleAsync(...);

    [ResponseTimeout(GrainTimeouts.HealthCheckResponseTimeout)]
    Task<HealthCheckResult> PingAsync(HealthCheckRequest? request = null);
}
Enter fullscreen mode Exit fullscreen mode

The principle is simple: conservative by default, relaxed as needed. This isn't some deep theory—it's just applying the principle of least privilege to timeout configuration. AI operations get two hours, health checks get one minute, each lives their own life, no one delays anyone.

5. Batch Grain Collection Configuration

By default, Orleans automatically recycles (deactivates) grains after they're idle for a while. This is good in itself, but frequent activation/recycling is like repeatedly opening and closing the fridge door—just adding overhead. We configured longer collection times uniformly for core grain types:

internal static void ConfigureGrainCollectionOptions(
    GrainCollectionOptions options,
    OrleansTimeoutPolicy? timeoutPolicy = null)
{
    var coreGrainTypes = new[]
    {
        typeof(SessionGrain).FullName,
        typeof(ClaudeCodeGrain).FullName,
        typeof(CodexGrain).FullName,
        typeof(GameDriverGrain).FullName,
        // ... 十余种核心 grain
    };

    var collectionAge = timeoutPolicy?.GrainCollectionAge
        ?? TimeSpan.FromHours(24);

    foreach (var name in coreGrainTypes)
    {
        options.ClassSpecificCollectionAge[name!] = collectionAge;
    }

    // MessageBucket exception: 10 minute fast recycling
    options.ClassSpecificCollectionAge[typeof(MessageBucketGrain).FullName!] =
        TimeSpan.FromMinutes(10);
}
Enter fullscreen mode Exit fullscreen mode

The core idea is differentiation: high-frequency short-lived grains recycle quickly to release memory, core business grains keep hot caches with minimal churn. This optimization looks simple, but if you don't set it, default collection strategy has visible impact on throughput—anyone who's wrestled with this knows what I mean.

Implementation Practice

Local Development and Persistence

HagiCode local development uses Development Clustering, persistence via SQLite Shard, already validated across multiple contributor environments:

context.Services.AddOrleans(siloBuilder =>
{
    siloBuilder.UseDevelopmentClustering(options =>
    {
        options.PrimarySiloEndpoint = new IPEndPoint(
            IPAddress.Loopback, siloPort);
    });

    siloBuilder
        .Configure<ClusterOptions>(options =>
        {
            options.ClusterId = "hagicode-cluster";
            options.ServiceId = "hagicode-service";
        })
        .AddActivityPropagation();

    siloBuilder.ConfigureServices(services =>
    {
        services.AddSqliteGrainStorage(
            ProviderConstants.DEFAULT_STORAGE_PROVIDER_NAME,
            options =>
            {
                options.ShardRootPath = storageOptions.ShardRootPath;
                options.ShardCount = storageOptions.ShardCount;
                options.UseWalMode = storageOptions.UseWalMode;
            });
    });
});
Enter fullscreen mode Exit fullscreen mode

Custom SqliteGrainStorage creates multiple database files by shard, paths like data/orleans/grains/shard_00.db. Production can switch to Azure Table Storage or SQL Server without changing a single line of code—this is the benefit of Orleans's storage provider abstraction. How to put it—good abstractions let you change backends like changing clothes; bad abstractions make it as painful as changing skin.

Concurrent Session Control

SessionConcurrencyManager uses in-process locks + global counters to manage active session count limits:

internal static class SessionConcurrencyManager
{
    private static readonly HashSet<SessionId> GlobalActiveSessions = [];
    private static readonly Lock Lock = new();

    internal static ConcurrencyCheckResult TryActivateSession(SessionId sessionId)
    {
        lock (Lock)
        {
            if (GlobalActiveSessions.Contains(sessionId))
                return new ConcurrencyCheckResult { Allowed = true };

            if (GlobalActiveSessions.Count >= _cachedMaxConcurrentSessions)
                return new ConcurrencyCheckResult { Allowed = false };

            GlobalActiveSessions.Add(sessionId);
            return new ConcurrencyCheckResult { Allowed = true };
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

This manager uses Stack Trace + Caller verification to restrict calls only from inside SessionGrain, preventing external code from bypassing concurrency checks. Though honestly, using internal static here actually breaks Actor isolation principles—but concurrency control is indeed a global need, so after weighing tradeoffs we accepted this design compromise. Perfect is the enemy of good—this saying holds true in architectural design too.

Health Check Integration

AIGrain.PingAsync() has two modes: lightweight connectivity probing and explicit Ping-Pong verification. The latter is used in the setup wizard to verify if a Provider is actually usable:

public async Task<HealthCheckResult> PingAsync(
    HealthCheckRequest? request = null)
{
    if (!isModelAware)
    {
        // Lightweight CLI readiness probe
        var provider = await aiProviderFactory.GetProviderAsync(
            AIProviderType.ClaudeCodeCli);
        var result = await provider.PingAsync(timeoutCts.Token);
        return new HealthCheckResult { IsHealthy = result.Success };
    }

    // Explicit Ping-Pong verification
    var response = await aiService.ExecuteAsync(new AIRequest
    {
        Prompt = HealthCheckPingPongProbe.Prompt,
        SystemMessage = HealthCheckPingPongProbe.SystemMessage,
        Temperature = 0,
        MaxTokens = 32
    }, timeoutCts.Token);

    var passed = HealthCheckPingPongProbe.IsExpectedResponse(
        normalizedResponse);
    return new HealthCheckResult { IsHealthy = passed };
}
Enter fullscreen mode Exit fullscreen mode

Temperature set to 0, MaxTokens limited to 32—guarantees response determinism while controlling costs. After all, health checks aren't for running benchmarks, just needs to be good enough. Same with people—knowing when to stop is rarer than knowing when to act.

Conclusion

Looking back at HagiCode's journey building backend systems with Orleans, five core design decisions are worth remembering:

  1. Configure timeouts at interface granularity, don't use global unified timeouts—AI operations 2h, health checks 1min, default 30s, each manages their own, no crossing paths.

  2. Differentiate Grain Collection ages—high-frequency short-lived grains recycle quickly, core business grains keep hot caches, be fast where needed, be stable where needed.

  3. Streaming pipeline should be async throughout—from CLI stdout to SignalR push, don't introduce any synchronous blocking middleware, let it flow naturally like water.

  4. Facade Grain splits complexity—components are stateless but share persistent state, much more maintainable than god classes. Divide and conquer—ancestor wisdom applies equally well to code.

  5. Mark stable names on Grain interfaces with [Alias]—the last line of defense for serialization compatibility. Guard this line, and the probability of being woken up by alerts in the middle of the night drops significantly.

Orleans's Virtual Actor model provides a complete-to-the-point-of-touching runtime abstraction for stateful, long-lifecycle session systems. If you're also building similar AI workbenches or real-time collaboration systems, this solution is worth trying—not because it's perfect, but because in the right scenarios, it's just right.

This memory can become something to cherish, though I was bewildered at the time... getting off track. Anyway, the code runs, the article is written. That's it.

Summary

Around "Solving Backend Distributed Challenges in AI Programming Workbenches with Orleans," a more prudent approach is to first get key configurations, dependency boundaries, and implementation paths working, then fill in optimization details.

When goals, steps, and acceptance criteria are clear, such solutions typically transition more smoothly into actual delivery.

Original Article & License

Thanks for reading. If this article helped, consider liking, bookmarking, or sharing it.
This article was created with AI assistance and reviewed by the author before publication.

Top comments (0)