DEV Community

Rafael Andrade
Rafael Andrade

Posted on

Trupe: Implementing Supervisor

In my saga to implement an Actor System, I've covered so far the basic usage of Actor. Now, I'm going to cover supervisor.

Supervisor

In case you don't know, in Actor Model, a supervisor is a fundamental piece for fault tolerance. Supervisor will monitor children (actors) and react when one of them exhibits bad behaviour.

A Supervisor is a special type of actor, focused only on monitoring and doing some action when children exhibit bad behaviour.

Bad behaviour

In Actor Model, bad behaviour is when an actor has an exception during message execution.

When an actor throws an exception during message processing we have 2 main strategies.

  • One For One: It'll restart only the failed child.
  • All for One: It'll restart all children, if one failed.

Now what to do when a restart doesn't work, and it's still failing? You can stop the child or scale it.

Restart

Another question is what restart means in Actor Model. In Actor Model and .NET it means killing/disposing that actor instance, the mailbox, but keeping the IActorReference. If the actor is a supervisor, we are going to stop and kill all children, making IActorReference for those children invalid.

Type of supervisor

Before deep diving into the code, let me show what are the supervisors that I'll implement

  • Supervisor: a preemptive supervisor, doesn't allow adding more children after startup
  • DynamicSupervisor: a dynamic supervisor, allows creating children after startup, but it only supports One For One strategy
  • PartitionSupervisor: a partition-based supervisor that creates a fixed pool of identical workers and routes messages using hash-based partitioning

Implementation

Foundation

Before implementing the supervisors, we need some building blocks: the supervision Strategy, the FailureAction, and the RestartPolicy.

public enum Strategy
{
    OneForOne,
    AllForOne,
}
Enter fullscreen mode Exit fullscreen mode
public enum FailureAction
{
    Restart,
    Stop,
    Escalate,
    Resume,
}
Enter fullscreen mode Exit fullscreen mode

Besides the strategy, each child actor has a RestartPolicy that defines whether it should be restarted:

public enum RestartPolicy
{
    Permanent,  // Always restart regardless of termination reason
    Transient,  // Restart only if it terminates abnormally
    Temporary,  // Never restart
}
Enter fullscreen mode Exit fullscreen mode

Now let's first create an interface called ISupervisor

/// <summary>
/// Represents a supervisor actor that manages child actors.
/// </summary>
public interface ISupervisor : IActor
{
    /// <summary>
    /// Gets the collection of child actor references managed by this supervisor.
    /// </summary>
    IEnumerable<IActorReference> Children { get; }
}
Enter fullscreen mode Exit fullscreen mode

We also need a way to describe how a child should be created. For that we have IChildSpecification:

public interface IChildSpecification
{
    Type ActorType { get; }
    IMailbox Mailbox { get; set; }
    RestartPolicy RestartPolicy { get; set; }
}
Enter fullscreen mode Exit fullscreen mode

And the default implementation:

public record ChildSpecification : IChildSpecification
{
    public ChildSpecification(Type actorType)
    {
        ActorType = actorType;
    }

    public Type ActorType { get; }
    public IMailbox Mailbox { get; set; } = new ChannelMailbox();
    public RestartPolicy RestartPolicy { get; set; } = RestartPolicy.Permanent;
}
Enter fullscreen mode Exit fullscreen mode

Each supervised actor is tracked via a Child metadata class that holds the actor instance, its process, mailbox, reference, restart policy, and restart counters:

public class Child(
    IActor actor,
    IMailbox mailbox,
    ActorProcess process,
    LocalActorReference reference,
    RestartPolicy restartPolicy,
    Type actorType)
{
    public IActor Actor { get; set; } = actor;
    public ActorProcess Process { get; set; } = process;
    public IMailbox Mailbox { get; } = mailbox;
    public LocalActorReference Reference { get; } = reference;
    public RestartPolicy RestartPolicy { get; } = restartPolicy;
    public Type ActorType { get; } = actorType;
    public int RestartCount { get; set; } = 0;
    public DateTimeOffset LastRestartTime { get; set; } = DateTimeOffset.MinValue;
    public Dictionary<string, object> Metadata { get; } = [];
    public bool IsSupervisor => Actor is ISupervisor;
}
Enter fullscreen mode Exit fullscreen mode

Supervisor

The base Supervisor is an abstract class that implements ISupervisor. It is preemptive — children are defined during initialization and cannot be added afterwards.

public abstract partial class Supervisor(IActorFactory actorFactory, ILogger logger)
    : Actor,
        ISupervisor,
        IHandleActorMessage<AddActor>,
        IHandleActorMessage<ActorFailed>,
        IHandleActorMessage<ActorTerminated>,
        IAsyncDisposable
{
Enter fullscreen mode Exit fullscreen mode

It exposes virtual properties so subclasses can customize the behaviour:

    protected virtual Strategy Strategy => Strategy.OneForOne;
    protected virtual int MaxRestarts => 3;
    protected virtual TimeSpan RestartWindow => TimeSpan.FromSeconds(5);
Enter fullscreen mode Exit fullscreen mode

During initialization, it calls the abstract OnInitializeAsync where subclasses define their children:

    public sealed override async ValueTask InitializeAsync(
        CancellationToken cancellationToken = default)
    {
        await OnInitializeAsync(cancellationToken);
        _initialized = true;
    }

    protected abstract ValueTask OnInitializeAsync(
        CancellationToken cancellationToken = default);
Enter fullscreen mode Exit fullscreen mode

When a child fails, the supervisor determines what to do via GetFailureAction. By default, it restarts the child unless the restart limit is exceeded, in which case it escalates:

    protected virtual FailureAction GetFailureAction(Child child, Exception exception)
    {
        if (child.RestartCount >= MaxRestarts)
        {
            return FailureAction.Escalate;
        }

        return FailureAction.Restart;
    }
Enter fullscreen mode Exit fullscreen mode

When the action is Restart, the strategy determines who gets restarted:

    protected virtual async Task ApplyRestartAsync(Child child)
    {
        child.RestartCount++;
        child.LastRestartTime = DateTimeOffset.UtcNow;

        if (Strategy == Strategy.OneForOne)
        {
            await ResetActorAsync(child);
        }
        else if (Strategy == Strategy.AllForOne)
        {
            await Task.WhenAll(Children.Select(ResetActorAsync));
        }
    }
Enter fullscreen mode Exit fullscreen mode

When the supervisor itself is restarted, all children are stopped and disposed:

    public override async ValueTask BeforeRestartAsync(
        CancellationToken cancellationToken = default)
    {
        foreach (var metadata in Children)
        {
            await StopActorAsync(metadata);
            await DisposeObjectAsync(metadata.Actor);

            metadata.Actor = null!;
            metadata.Process = null!;
            metadata.Metadata.Clear();
        }

        Children = [];
    }
Enter fullscreen mode Exit fullscreen mode

DynamicSupervisor

The DynamicSupervisor extends Supervisor to allow adding and removing children at runtime. It is sealed to the OneForOne strategy because restarting all actors when one fails doesn't make sense when actors are created independently.

public abstract class DynamicSupervisor(IActorFactory actorFactory, ILogger logger)
    : Supervisor(actorFactory, logger),
        IHandleActorMessage<RemoveChild>
{
    protected sealed override Strategy Strategy => Strategy.OneForOne;
Enter fullscreen mode Exit fullscreen mode

Children are added dynamically by sending an AddActor command to the supervisor's own mailbox:

    protected override IActorReference AddChild(IChildSpecification specification)
    {
        var actorRef = new LocalActorReference(specification.Mailbox);
        Context.Self.Tell(new AddActor(specification, actorRef));
        return actorRef;
    }
Enter fullscreen mode Exit fullscreen mode

And removed via RemoveChild:

    public async ValueTask HandleAsync(
        RemoveChild message,
        CancellationToken cancellationToken = default)
    {
        var child = Children.FirstOrDefault(x => x.Actor == message.Actor);

        if (child != null)
        {
            Children = Children.Remove(child);

            await StopActorAsync(child);
            await DisposeObjectAsync(child.Actor);
            await child.Process.DisposeAsync();

            child.Actor = null!;
            child.Process = null!;
        }
    }
Enter fullscreen mode Exit fullscreen mode

The DynamicSupervisor also respects RestartPolicy during failures and terminations. Temporary actors are removed after failure, and non-permanent actors are removed after termination.

RootSupervisor

The RootSupervisor is a concrete implementation of Supervisor that serves as the top-level supervisor in the actor system. It implements the IRootSupervisor marker interface and is configured via RootSupervisorOptions, which holds a list of IChildSpecification children.

public class RootSupervisor(
    IOptions<RootSupervisorOptions> options,
    IActorFactory actorFactory,
    ILogger<RootSupervisor> logger
) : Supervisor(actorFactory, logger), IRootSupervisor
Enter fullscreen mode Exit fullscreen mode

During initialization, it adds all children defined in the options:

    protected override ValueTask OnInitializeAsync(CancellationToken cancellationToken = default)
    {
        foreach (var child in options.Value.Children)
        {
            AddChild(child);
        }
        return new ValueTask();
    }
Enter fullscreen mode Exit fullscreen mode

Unlike the base Supervisor, the RootSupervisor always restarts failed children — it never escalates, since there is no parent supervisor above it:

    protected override FailureAction GetFailureAction(Child child, Exception exception)
    {
        return FailureAction.Restart;
    }
Enter fullscreen mode Exit fullscreen mode

PartitionSupervisor

The PartitionSupervisor<TActor> creates a fixed pool of identical worker actors and routes messages to them using hash-based partitioning. The number of workers defaults to Environment.ProcessorCount.

public abstract partial class PartitionSupervisor<TActor>(
    IActorFactory actorFactory, ILogger logger, int workers)
    : Actor,
        ISupervisor,
        IHandleActorMessage<ActorFailed>,
        IHandleActorMessage<ActorTerminated>,
        IAsyncDisposable
{
Enter fullscreen mode Exit fullscreen mode

It exposes similar configuration properties:

    protected virtual int Workers { get; } = workers;
    protected virtual Strategy Strategy => Strategy.OneForOne;
    protected virtual RestartPolicy DefaultRestartPolicy => RestartPolicy.Permanent;
    protected virtual int MaxRestarts => 3;
    protected virtual TimeSpan RestartWindow => TimeSpan.FromSeconds(5);
Enter fullscreen mode Exit fullscreen mode

During initialization, it creates all workers upfront:

    public override async ValueTask InitializeAsync(
        CancellationToken cancellationToken = default)
    {
        for (var i = 0; i < Workers; i++)
        {
            CreateActor(
                new ChildSpecification(typeof(TActor))
                {
                    RestartPolicy = DefaultRestartPolicy,
                    Mailbox = CreateMailbox(),
                });
        }

        await OnInitializeAsync(cancellationToken);
        _initialized = true;
    }
Enter fullscreen mode Exit fullscreen mode

Messages are routed to a worker using a partition key. The key is hashed and mapped to a worker index:

    protected virtual IActorReference GetActorReference<TKey>(TKey key)
        where TKey : notnull
    {
        var hash = Math.Abs(GetHashcode(key));
        return Children[hash % Children.Count].Reference;
    }
Enter fullscreen mode Exit fullscreen mode

This ensures that messages with the same key always go to the same worker, which is useful for scenarios like session affinity or consistent data partitioning.

Conclusion

With the Supervisor, DynamicSupervisor, PartitionSupervisor, and the RootSupervisor as the top-level entry point, we now have a solid foundation for fault tolerance in our actor system. Each supervisor variant addresses a different use case — from static child sets, to dynamic actor management, to hash-based worker partitioning.

If you're interested in the deep details of the implementation, feel free to explore the full source code on GitHub: lillo42/trupe.

Top comments (0)