Dmitry Bondarchuk

Posted on Mar 27

I Needed a Workflow Engine for AI Agents. None of Them Fit. So I Built One.

#agents #ai #automation #showdev

Part three of the vexdo series — after building a local AI dev pipeline and moving it to the cloud

vexdo works. I use it. It handles the boring parts of shipping code — the implement-review-fix loop that used to eat my afternoons.

At some point I started thinking: could this be something more than a personal tool? Not just a CLI I run on my machine, but an actual product. Something with a proper foundation, not held together with state files and hardcoded pipeline logic.

And that's where things got complicated.

The problem with "just use a workflow engine"

The obvious answer when you want to orchestrate multi-step processes is: use a workflow engine. Airflow, Temporal, BullMQ, Prefect — there are plenty of them, and some are very good at what they do.

The problem is what they're good at.

These engines are built around a core assumption: you know your steps upfront. You define a DAG — nodes, edges, dependencies — and the engine executes it. The graph is fixed. That's the contract.

For traditional workflows, this is fine. ETL pipelines, CI/CD jobs, batch processing — you know what needs to happen before it starts happening.

AI agents break this assumption.

Here's a concrete example from vexdo. When an agent starts working on a task, it first analyzes the codebase — what files are involved, which modules are sensitive, how deep the change goes. But the result of that analysis determines what comes next.

Simple task touching one service? Skip the design council, go straight to implementation.

Task that touches the payments module? Spawn a dedicated security review. If it also changes the API schema, spawn a contract validation step. If the codebase has low test coverage, spawn a test generation pass first.

None of this is knowable when the workflow starts. The agent discovers it by doing the work.

If you try to handle this with a fixed DAG, you end up with one of two bad options:

Pre-define every possible branch — the graph becomes a sprawling mess of conditional edges, and half the nodes never run. You're essentially writing a decision tree disguised as a workflow.
Treat the whole thing as one big step — you lose parallelism, observability, retry granularity, and the ability to checkpoint. Your "workflow" is just a black box that either finishes or fails.

Neither option is good. What you actually want is a workflow that can extend itself at runtime — where completing a step can add new steps to the graph, based on what was discovered.

The core idea: a graph that grows

I've been calling this a living graph.

In a traditional workflow engine, the DAG is immutable after you start a run. In a living graph, nodes can spawn new nodes as part of their output. The graph is a starting point, not a constraint.

When a node completes, its result can include a list of new nodes to add to the graph — with their own dependencies, retry policies, and compensation logic. The scheduler picks them up and runs them exactly like any other node. From the engine's perspective, there's no difference between a node that was defined at the start and one that was spawned mid-run.

This is the key idea behind Grael — the workflow engine I built specifically for AI agent pipelines.

Workflow starts:
  [scout] → ???

Scout runs, analyzes the codebase, returns:
  output: { complexity: "high", touchedModules: ["payments", "api"] }
  spawn:  [
    { id: "council",   dependsOn: ["scout"] },
    { id: "implement", dependsOn: ["council"] },
    { id: "sec-review",dependsOn: ["implement"] }   ← spawned because: payments
    { id: "reviewer",  dependsOn: ["implement"] },
    { id: "arbiter",   dependsOn: ["reviewer", "sec-review"] },
    { id: "pr",        dependsOn: ["arbiter"] }
  ]

Graph is now:
  [scout] → [council] → [implement] → [reviewer] ──→ [arbiter] → [pr]
                                   └→ [sec-review] ─┘

The spawn happened inside the scout's activity — the engine didn't know about any of this when the workflow started.

Another example: spec contradictions. The arbiter reviews the diff and notices that what the executor built doesn't match the original spec — not a code quality issue, but a genuine conflict in requirements. Maybe the spec said "use cursor-based pagination" but the executor implemented offset-based because an existing helper made it easier. Maybe two requirements in the spec are mutually exclusive and the executor quietly picked one.

In a fixed pipeline, this either escalates to a human or gets sent back to the executor with a "fix it" comment. But the right answer is often neither — you need to go back to whoever wrote the spec and ask for a decision.

With a living graph, the arbiter can spawn a spec-clarification node instead:

Arbiter runs, detects contradiction, returns:
  output: { decision: "spec-contradiction" }
  spawn:  [
    {
      id: "spec-clarification",
      activityType: "spec-writer",   ← could be a human checkpoint or another agent
      dependsOn: ["arbiter"],
      input: {
        question: "Spec says cursor-based pagination, executor used offset. Which do you want?",
        context: { diff, specExcerpt }
      }
    },
    {
      id: "implement-revised",
      activityType: "executor",
      dependsOn: ["spec-clarification"]   ← continues once clarified
    },
    ...
  ]

Graph grows:
  ... → [arbiter] → [spec-clarification] → [implement-revised] → [reviewer-2] → [pr]

The spec-writer activity type could be anything — a human approval gate, a dedicated planning agent that re-evaluates the requirements, or a call to the original spec-generation step with additional context. The arbiter doesn't need to know. It just knows this is a spec problem, not a code problem, and spawns the right node type. That routing decision is something only an agent can make in context. You can't pre-define it in a YAML file before the run starts.

What Grael actually is

Grael is a Go service built around this idea. The code is on GitHub: github.com/ubcent/grael. A few things I cared about when building it:

Everything is an event. The entire state of a workflow run is an append-only event log. The current graph, node states, retry counts — all of it is derived by replaying events from the WAL. This means crashes are recoverable, history is auditable, and replay is deterministic. If Grael goes down mid-run, it picks up exactly where it left off.

Workers are just processes that poll for tasks. Any language, any runtime. You register a worker with the activity types it can handle, then poll for tasks. When you get one, you run it and report back. The Go SDK is one file. A TypeScript SDK is straightforward to build on top of the gRPC API.

Compensation is built in. Each node can declare a compensation activity — what to undo if things go wrong downstream. When a run fails, Grael automatically runs compensations in reverse order. Saga pattern, out of the box.

Human checkpoints are first-class. An activity can return a checkpoint signal instead of completing — the node enters a waiting state, unrelated work continues, and the run resumes when someone calls ApproveCheckpoint. The checkpoint timeout is configurable per node. This is how you put a human in the loop without halting the entire pipeline.

Demo

The demo runs a morning incident briefing workflow. The scenario: an on-call team needs to quickly assemble a picture of what's happening and decide what to investigate. Here's the shape of the run:

Three preparation steps start in parallel — collect customer escalations, pull checkout metrics, prepare the briefing outline.
A planning step runs once those complete and decides which follow-up checks need to happen.
Grael spawns the concrete investigation nodes based on that decision — verify checkout latency, confirm payment auth drop, review support spike. These weren't in the graph when the run started.
One spawned investigation fails retryably. Grael retries it automatically.
An editor approval gate opens. The run doesn't freeze — the other investigations keep progressing.
Once all investigations are done and the approval comes through, the results flow into assembling the final brief.
The brief is published. Run completes.

What to watch for: the graph growing after the planning step, multiple nodes running at the same time, the failed node recovering, and the approval gate that's clearly distinct from a stall.

One more thing worth noting: what you're watching is a replay. Not a live run recorded at demo time — a deterministic replay of a previously recorded execution, driven from the event log.

This is possible because of how Grael works internally. Every state transition — node started, node completed, spawn happened, retry scheduled, checkpoint reached — is written to an append-only WAL before anything changes in memory. The current state of a run is always derived by replaying that log from the beginning. There's no separate "current state" that can drift or get corrupted.

The consequence is that any run can be replayed exactly. Same events, same order, same graph shape, same outcome. That's what makes the demo reproducible — and it's also what makes crash recovery work. If Grael goes down mid-run, it replays the log on restart and picks up where it left off. The demo and the durability guarantee are the same mechanism.

Again — this is very early. The workflow is synthetic, built to demonstrate these behaviors in a controlled setting. I haven't thrown real production workloads at this. Consider it a proof of concept, not a reliability claim.

To be clear: this is very early. What you're seeing is a proof of concept, not a production system. I've run it through a handful of synthetic workflows to validate the architecture. I haven't thrown real production workloads at it. There are rough edges, missing features, and exactly zero battle testing. The goal right now is to get the core idea right — not to ship something stable.

Why this matters for vexdo specifically

The current version of vexdo has its orchestration hardcoded. The pipeline is always: submit → review → arbiter → fix → repeat. That works for the use case I built it for, but it's not general enough to turn into a product.

With Grael underneath, vexdo becomes a set of activity workers — scout, executor, reviewer, arbiter, pr-creator — registered against an engine that handles the graph. The pipeline itself is just a starting node definition. What it grows into depends on what the agents discover.

This also unlocks things that were awkward before: running review and security checks in parallel, spawning additional investigation steps when something looks risky, human approval gates at specific points in the pipeline. These become configuration, not code changes.

What's next

Grael needs a gRPC server layer before it can talk to TypeScript workers. That's the immediate next step. After that, the TypeScript SDK, then wiring vexdo's agents into it.

If this is interesting to you — either because you're building something similar, or because you're skeptical this is the right abstraction — I'd genuinely like to hear it. The living graph idea feels right to me, but I'm very much still figuring out where it breaks.