I Broke My Own Workflow Engine at Scale — Here's How I Fixed It

#architecture #dotnet #performance #systemdesign

In my last post, I broke down how I built FlowForge, a fault-tolerant DAG workflow engine using ASP.NET Core, React, and MySQL. I explained how I solved complex branching and dependency execution using Kahn's Algorithm and a database-backed state machine.

It works perfectly for hundreds of concurrent users. But as engineers, we always have to ask the dangerous question: What happens when we 1000x the load? Let's deep dive into the absolute limits of my current architecture, watch it break, and re-architect it to handle a massive scale: 1,000,000 flow executions per second.

How the V1 Engine Works (And Why It Will Fail)
To understand how to scale the engine, you need to understand what it's currently doing.

Storage & Parsing:
When a user builds a flow in the React frontend, we save the entire raw JSON (including UI viewports and node coordinates) into the definitionJson column of our Flow table. When execution starts, we parse this JSON into a backend-friendly ParsedFlow (a strict list of Nodes and Edges) and verify there are no cyclical loops using Topological Sorting.

The Execution Loop:
When a flow is triggered, we create a FlowInstance in the database to track the overall run. We then generate NodeInstances for every step, marking the initial trigger node as Ready and everything else as Pending.

The Polling Mechanism:
A background service aggressively polls the MySQL database every 10 milliseconds, looking for nodes in the Ready state. When it finds one, it executes it, evaluates the downstream dependencies, and marks the next children as Ready.

The Breaking Points at 1 Million RPS
If we force 1 million workflows per second through this V1 architecture, two things will immediately catch fire:

Breaking Point 1: ThreadPool Starvation (CPU Bottleneck)
ASP.NET Core is incredibly efficient with async/await, releasing threads back to the pool during I/O operations. However, parsing a massive workflow JSON 1 million times a second is a heavy, synchronous, CPU-bound operation. We will quickly exhaust the available worker threads, leading to severe latency spikes and HTTP 503 errors as the API Gateway drops incoming requests.

Breaking Point 2: The Database Meltdown
Polling a database every 10ms for 1,000,000 concurrent flows results in an astronomical amount of read queries. The MySQL CPU will hit 100%, disk I/O will bottleneck, and the database will crash before the web servers even break a sweat.

Re-Architecting for 1M Flows per Second

To survive this scale, we have to fundamentally shift from a monolithic, polling-based architecture to a distributed, event-driven one.

1. Decoupling with Message Queues
Instinctively, many developers think, "Just throw a Message Queue in front of it." But what actually goes into the queue?

Instead of the API server trying to execute the flow, the API server's only job is to receive the HTTP request, validate it, and drop a StartFlowMessage (containing the FlowId and payload) into a high-throughput broker like Kafka or RabbitMQ. The API responds with a 202 Accepted immediately, freeing up the web thread in milliseconds.

2. Scaling Horizontally with Stateless Workers
Now that execution is decoupled, we deploy dedicated, stateless Worker Services reading directly from the Kafka partitions. If the queue gets backed up, we simply spin up 50 more worker containers in Kubernetes to chew through the backlog.

3. Eradicating Database Polling (Event-Driven Execution)
We completely remove the 10ms SELECT loop. When a worker finishes executing "Node A", it doesn't wait for the database. Instead, it calculates what nodes are unblocked, updates their state in MySQL, and immediately pushes a new ExecuteNodeMessage into the queue for "Node B". The engine is now entirely event-driven.

4. Handling Concurrency and Idempotency
Here is the massive catch with horizontal scaling: What if two different workers pick up the same flow at the exact same time?

To maintain idempotency, we implement Optimistic Concurrency Control in our database. We add a Version or RowVersion column to our NodeInstance table. When a worker tries to update a node from Ready to Running, it executes:
UPDATE NodeInstance SET Status = 'Running', Version = Version + 1 WHERE Id = X AND Version = Y;
If another worker already took the job, the row version will have changed, the update will return 0 affected rows, and the second worker knows to safely drop the duplicate task.

5. Reducing Latency with Distributed Caching
Parsing the raw definitionJson from MySQL on every execution is too expensive. We introduce Redis to cache the highly requested, pre-parsed DAG structures. When a worker picks up a flow, it grabs the pre-compiled execution plan from Redis in sub-milliseconds, bypassing the JSON deserialization tax entirely.

Conclusion
Scaling a system is rarely about writing "faster code"; it is about removing bottlenecks. By shifting from database polling to an event-driven queue, offloading work to stateless consumers, utilizing distributed caching, and enforcing optimistic concurrency, FlowForge evolves from a reliable prototype into an enterprise-grade orchestration engine.
(Missed the first part? Check out the original build: How I Built a Fault-Tolerant DAG Workflow Engine in ASP.NET Core)

DEV Community

I Broke My Own Workflow Engine at Scale — Here's How I Fixed It

Top comments (0)