How I Built a Fault-Tolerant DAG Workflow Engine in ASP.NET Core

#architecture #automation #dotnet #showdev

I recently built FlowForge, a visual workflow automation platform where users can connect multiple external applications and orchestrate complex flows simply by dragging and dropping nodes onto a canvas.

At first glance, building a Zapier-like clone seemed straightforward: build a React frontend to draw the boxes, and a backend to execute them. But as I got deeper into the architecture, handling state, concurrency, and fault tolerance turned this into a massive distributed systems challenge.

Without any filler, let's dive into the architecture of how I actually built a workflow automation platform from scratch.

System Requirements
Functional Requirements:

Users can authenticate with multiple third-party applications (Google, Slack, etc.) via OAuth.
Users can build and configure workflows using a drag-and-drop canvas.

Non-Functional Requirements:

Pluggability: Adding a new integration must be as simple as adding a new file, without modifying the core engine.
Fault Tolerance (Persistence): Even if the server crashes unexpectedly mid-execution, the workflow must be able to resume exactly where it left off.
Concurrent Branching: The engine must support complex dependencies (e.g., Node B and Node C can run simultaneously, but both must wait for Node A to finish).

Core Entities

Flow: The blueprint of the workflow.
Connection: External OAuth credentials and integration metadata.
FlowInstance: A single, unique execution run of a Flow.
NodeInstance: The execution state of an individual step within that flow.

Tech Stack & High-Level Architecture

Frontend: React.js
Backend: ASP.NET Core Web API
Database: MySQL

Core Services:

Client: The visual node-based editor for the user interface.

API Gateway: Handles routing and load distribution to the backend services.

Flow Service: Manages CRUD operations for the workflows.

Connection Service: Securely handles external application connections and token management.

Flow Engine: The core "brain" of the system that parses and executes the workflows.

Deep Dive 1: The Evolution of the Core Engine
Building the execution engine was the biggest architectural bottleneck. I actually had to rewrite this component three separate times to get it right.

V1: Linear Execution (The Naive Approach)
Initially, I just converted the workflow into a List and executed the nodes in sequence. This failed immediately. Real workflows don't always run in a straight line, and sometimes a flow gets triggered by an event in the middle of a chain.

V2: Topological Sorting (Kahn's Algorithm)
To fix the branching issue, I modeled the workflow as a Directed Acyclic Graph (DAG). By using Kahn's Algorithm for topological sorting, I could preserve the strict execution order, ensure dependencies were met, and detect invalid cycles.

However, I hit a massive limitation: State Loss. Because Kahn's algorithm was running entirely in server memory, if the ASP.NET server restarted, the entire execution vanished like it never existed.

V3: The Persistent DAG Scheduler (The Final Form)
To achieve true fault tolerance, I moved the state machine out of memory and into the MySQL database. The engine now operates as a persistent scheduler. It continuously polls the database to find nodes where all parent dependencies are marked as Completed, marks them as Ready, and executes them.

Here is the actual C# implementation of the persistent engine:

 public class FlowEngine
 {
     private readonly IServiceProvider _serviceProvider;
     private readonly ILogger<FlowEngine> _logger;
     private readonly IEnumerable<INode> _availableNodes;
     private readonly DagConvertor _dagConvertor;
     private readonly NodeExecutionRepository _nodeExecutionRepository;
     private readonly IFlowInstanceRepository _flowInstanceRepository;

     public FlowEngine(
         IServiceProvider serviceProvider,
         ILogger<FlowEngine> logger,
         IEnumerable<INode> availableNodes,
         DagConvertor dagConvertor,
         NodeExecutionRepository nodeExecutionRepository,
         IFlowInstanceRepository flowInstanceRepository)
     {
         _serviceProvider = serviceProvider;
         _logger = logger;
         _availableNodes = availableNodes;
         _dagConvertor = dagConvertor;
         _nodeExecutionRepository = nodeExecutionRepository;
         _flowInstanceRepository = flowInstanceRepository;
     }

     // Allows service to reuse DAG building logic
     public Dag BuildDag(ParsedFlow parsedFlow)
     {
         return _dagConvertor.ParsedFlowToDag(parsedFlow);
     }

     // ============================================================
     // 🚀 Persistent DAG Scheduler
     // ============================================================

     public async Task RunPersistentAsync(
         Guid flowInstanceId,
         ParsedFlow parsedFlow,
         string clerkUserId,
         Dictionary<string, object>? initialPayload = null)
     {
         var payload = initialPayload ?? new Dictionary<string, object>();
         var dag = _dagConvertor.ParsedFlowToDag(parsedFlow);

         while (true)
         {
             // Fail fast
             if (await _nodeExecutionRepository.AnyFailedAsync(flowInstanceId))
             {
                 await _flowInstanceRepository.MarkFailedAsync(flowInstanceId);
                 return;
             }

             // Complete if done
             if (await _nodeExecutionRepository.AllCompletedAsync(flowInstanceId))
             {
                 await _flowInstanceRepository.MarkCompletedAsync(flowInstanceId);
                 return;
             }

             var nodeExecution =
                 await _nodeExecutionRepository.GetNextReadyNodeAsync(flowInstanceId);

             if (nodeExecution == null)
             {
                 await Task.Delay(50);
                 continue;
             }

             await _nodeExecutionRepository.MarkRunningAsync(nodeExecution.Id);

             var parsedNode =
                 parsedFlow.Nodes.First(n => n.Id == nodeExecution.NodeId);

             try
             {
                 var output = await ExecuteNodeAsync(
                     parsedNode,
                     parsedFlow.Name,
                     clerkUserId,
                     payload);

                 await _nodeExecutionRepository
                     .MarkCompletedAsync(nodeExecution.Id);

                 if (output != null)
                 {
                     foreach (var kvp in output)
                         payload[kvp.Key] = kvp.Value;
                 }

                 await UnlockChildrenAsync(
                     flowInstanceId,
                     dag,
                     parsedNode.Id);
             }
             catch (Exception ex)
             {
                 await _nodeExecutionRepository
                     .MarkFailedAsync(nodeExecution.Id, ex.Message);
             }
         }
     }

     private async Task<Dictionary<string, object>?> ExecuteNodeAsync(
         ParsedNode node,
         string flowName,
         string clerkUserId,
         Dictionary<string, object> payload)
     {
         var executor =
             _availableNodes.FirstOrDefault(n => n.Type == node.Type);

         if (executor == null)
             throw new InvalidOperationException(
                 $"Executor for node type '{node.Type}' not found.");

         var context = new FlowExecutionContext(
             clerkUserId,
             payload,
             ConvertToJsonElement(node.Data));

         return await executor.ExecuteAsync(context, _serviceProvider);
     }

     private async Task UnlockChildrenAsync(
         Guid flowInstanceId,
         Dag dag,
         string completedNodeId)
     {
         if (!dag.AdjList.TryGetValue(completedNodeId, out var children))
             return;

         foreach (var child in children)
         {
             if (!dag.ReverseAdjList.TryGetValue(child.Id, out var parents))
                 continue;

             var allParentsCompleted =
                 await _nodeExecutionRepository
                     .AreAllParentsCompletedAsync(flowInstanceId, parents);

             if (allParentsCompleted)
             {
                 await _nodeExecutionRepository
                     .MarkReadyAsync(flowInstanceId, child.Id);
             }
         }
     }

     private JsonElement ConvertToJsonElement(object? obj)
     {
         if (obj == null) return default;
         var json = JsonSerializer.Serialize(obj);
         return JsonSerializer.Deserialize<JsonElement>(json);
     }
 }

Deep Dive 2: Building a Pluggable Architecture
A workflow engine is useless if it is hard to add new integrations. To make the system truly pluggable, I utilized the Factory Pattern and .NET Reflection.

Instead of hardcoding a massive switch statement or storing the node executors in a static list, I made every node type (Action, Trigger, Conditional) inherit from an INode interface. At startup, the application scans the assembly to find all implementations and registers them dynamically in the Dependency Injection container.

 public static class ServiceCollectionExtenctions
 {
     public static IServiceCollection AddNodeServices(this IServiceCollection services)
     {
         var NodeType = typeof(INode);
         var implementations = AppDomain.CurrentDomain.GetAssemblies()
             .SelectMany(s => s.GetTypes())
             .Where(t => NodeType.IsAssignableFrom(t) && t is { IsClass: true, IsAbstract: false });
         foreach (var implementation in implementations)
         {
             services.AddScoped(typeof(INode), implementation);
         }
         return services;
     }
 }

Now, adding a new Slack integration is as simple as creating a new class that implements INode. The engine handles the rest automatically.

Deep Dive 3: Ensuring Concurrent Execution
If you look at the engine logic, you will notice we retrieve the status of each node directly from the database.

If a parent node completes its execution and unlocks two distinct child nodes, both of those children will independently transition to the Ready state. Because the engine processes ready nodes asynchronously, both child branches will execute concurrently without blocking each other.

Trade-Offs & Scaling Bottlenecks
Building this highlighted some clear scaling limits in my current architecture:

ThreadPool Exhaustion: Right now, I am executing nodes asynchronously in memory. If the platform scales to 100,000 concurrent users running workflows, we will hit the ASP.NET ThreadPool limits. To solve this, I would need to decouple the execution by introducing a Message Queue (like RabbitMQ) and dedicated background worker services.

Database Write Limits: Flow execution is incredibly write-heavy (updating status to Running, Completed, Failed every few milliseconds). MySQL is great, but at massive scale, it will hit write-throughput bottlenecks. Shifting the execution state storage to a Key-Value database (like DynamoDB or Redis) would be necessary for global scalability.

Conclusion
Building FlowForge forced me to move beyond basic CRUD applications and tackle real distributed system challenges. By leveraging topological sorting for execution order, a database-backed state machine for fault tolerance, and .NET reflection for pluggability, the engine is robust and highly extensible.

You can explore the complete architecture and source code here: FlowForge on GitHub.

What are your thoughts on using database polling for the DAG scheduler versus an event-driven message queue? Let me know in the comments!