DEV Community

Arkadiusz Przychocki
Arkadiusz Przychocki

Posted on • Originally published at blog.arkstack.dev

Where StructuredTaskScope Ends: Building the Flow Layer in Exeris

This is the fifth article in the Exeris Kernel series.

The series has been building one coherent architectural picture:
article one replaced ThreadLocal
with ScopedValue on the context propagation path;
article two argued that some layers do not benefit
from runtime polymorphism;
article three showed where
StructuredTaskScope genuinely earns its keep;
article four pushed TLS off the heap entirely.

This one is about where STS stops being the right tool — and what had to be built instead.


The Constraint I Kept Hitting

The pressure came from a specific place: I was building the orchestration layer for
Exeris Kernel — the part that drives multi-step business processes with compensation
and external event wiring.

The obvious starting point was an existing enterprise saga framework. I considered
Camunda, Axon, and several others in that category, and rejected all of them. The
flexibility the kernel's other constraints demanded — zero-allocation hot paths,
ScopedValue propagation, off-heap state — was not something I could retrofit onto a
framework whose orchestration model was already opinionated. Each also meant a separate
operational process — Axon Server, Camunda engine — with its own JVM, memory footprint,
and lifecycle. The cost-efficiency dimension I only quantified properly later, but
the operational overhead was visible from day one.

So I went one level lower. At first, StructuredTaskScope seemed like a natural fit.
You fork work, you join it, you get structured lifecycle guarantees. Project Loom made
this cheap and clean.

Then I hit a step that needed to wait for a payment confirmation.

Not for a few milliseconds. Not until a downstream service responds synchronously.
The step needed to yield execution, persist its state, and resume when an external
event arrived — possibly hours later, possibly after a JVM restart.

StructuredTaskScope has no model for this. It is structured in space — all forked
tasks are bounded to a scope that lives on a single call stack. It is not structured
in time. The moment your execution boundary needs to span time rather than scope,
you have left StructuredTaskScope territory entirely.

That is the constraint that decided the architecture.


What STS Still Does Inside Flow

Before going further: StructuredTaskScope is not replaced. It is still used in
several places throughout the kernel — and not always for what most introductions
to STS show.

The first use is the obvious one. Inside a single Flow step, when an action needs
to call two independent services and merge the results before returning
FlowOutcome.CONTINUE. The execution is bounded to that step's Virtual Thread.
STS owns the fan-out; Flow owns the step boundary. Standard structured concurrency.

The second is at L3, inside the Events subsystem. InMemoryEventBus.publishAndAwait
opens a scope, forks one Virtual Thread per registered handler, joins, and returns
once every handler has finished. CommunityEventLoop.dispatchBatch does the same
shape on a different cadence: per drained batch of events, fork one VT per registered
batch processor, join when all are done. Both are clean ad-hoc fan-outs with a
deterministic join point — exactly the case STS was designed for.

The third is the most subtle. OutboxOrchestrator.ownerLoop opens a scope and forks
exactly one task — its long-lived poll-and-flush loop. Why use STS for a single fork?
Because Java 26 enforces an owner-thread rule: open(), fork(), join(), and
close() must all happen on the same thread. The orchestrator spawns a dedicated
owner Virtual Thread specifically to satisfy that constraint, so the lifecycle scope
of the loop is explicit and structured rather than implicit and tracked across threads.

The point where Flow takes over is the inter-step boundary — the moment a step
returns something other than "I am done, continue." Below that boundary, STS
remains the right tool, in all three patterns above. Flow picks up where the
call stack ends.


The Execution Model

The execution model came from somewhere I didn't expect. Having ruled out the
enterprise saga frameworks and dropped one level lower into STS, I still needed
to decide what the saga state machine itself should look like. The first analogy
that came to mind was the TLS state machine I'd just been working on (described
in article four). TLS is one
of the most rigorously specified state machines in widely-deployed software —
states explicit, transitions deterministic, I/O bounded to state boundaries.
Saga orchestration, structurally, has the same shape. I tried imitating the TLS
pattern. It fit, and it stayed.

Each Flow instance runs on its own Virtual Thread, launched by CoreFlowRuntime:

// From CoreFlowRuntime.launch()
Thread thread = Thread.ofVirtual()
    .name("exeris-flow-"
          + instance.key().instanceIdMost()
          + '-'
          + instance.key().instanceIdLeast())
    .unstarted(() -> runInstance(instance, startStep));
runningThreads.add(thread);
thread.start();
Enter fullscreen mode Exit fullscreen mode

That Virtual Thread executes steps in a loop until one of three things happens:

// From CoreFlowRuntime.applyOutcome()
return switch (outcome) {
    case CONTINUE -> applyContinueOutcome(instance, step, stepIndex);
    case COMPLETE -> {
        applyCompleteOutcome(instance, step);
        yield -1;
    }
    case PARK -> {
        applyParkOutcome(instance, stepIndex);
        yield -1;
    }
    default -> -1;
};
Enter fullscreen mode Exit fullscreen mode

CONTINUE — advance to the next step, stay on the same Virtual Thread.
COMPLETE — short-circuit directly to terminal state, bypassing remaining steps.
PARK — this is the boundary. The Virtual Thread exits. State is persisted. Execution
is suspended until an explicit wake() call.

Figure 1: FlowState transitions and the Virtual Thread boundary.

That PARK path is exactly what has no equivalent in StructuredTaskScope. When
applyParkOutcome runs, it serializes the current FlowSnapshot — step index,
compensation stack, state, timeout — and registers the instance in the parked map.
A wake() call later resolves that snapshot, either from in-memory or from the
FlowSnapshotStore, and launches a new Virtual Thread that resumes from
currentStep + 1.


Defining a Flow

Before a flow can run, it must be compiled into an execution plan. The builder API
is intentionally explicit — you declare steps, their compensation actions, and
any non-linear transitions:

FlowExecutionPlan orderFulfillment = engine.plans()
    .newDefinition("order-fulfillment")
    .step("validate-order",
          ctx -> {
              validateOrder(ctx);
              return FlowOutcome.CONTINUE;
          },
          ctx -> { /* no rollback needed for validation */ })
    .step("charge-payment",
          ctx -> {
              initiatePayment(ctx);
              return FlowOutcome.PARK; // wait for async confirmation
          },
          ctx -> refundPayment(ctx))
    .step("ship-order",
          ctx -> {
              dispatchShipment(ctx);
              return FlowOutcome.COMPLETE;
          },
          null)
    .build();

engine.plans().compile(orderFulfillment);
Enter fullscreen mode Exit fullscreen mode

compile() validates the step graph, builds the adjacency matrix and the nextStep
index array, and registers the plan in the planCatalog. After this point, the engine
can schedule instances of this definition.

The trade-off here is deliberate: the definition is closed-world at compile time.
Adding or reordering steps after in-flight instances exist is a schema migration problem
— and the engine does not solve it silently. If an in-flight Saga was parked on a step
that no longer exists after a redeployment, EX-FLOW-7002 with SCHEMA_MISMATCH is
thrown on wake. The safe pattern is blue/green with Saga drain before switching traffic.
I kept this constraint explicit rather than hiding it behind version-transparent routing,
because transparent versioning that does not actually guarantee correctness is worse than
visible friction.


When the Wake Comes from Outside

Park/wake with a direct scheduler.wake() call covers the case where you control both
sides. The harder case is choreography: the flow should resume when an event arrives
from an external system — a payment gateway, a warehouse service, a user action.

This is where FlowChoreographyBridge and the sealed ChoreographyDecision
type come in. You register a mapper that translates incoming event descriptors
into routing decisions:

engine.registerChoreographyMapper(
    descriptor -> {
        // EventDescriptor carries the flow instance ID in streamIdHigh/streamIdLow.
        // The convention: the event producer encodes the target instance UUID
        // in those fields when publishing a choreography trigger.
        long instanceMost = descriptor.streamIdHigh();
        long instanceLeast = descriptor.streamIdLow();
        if (instanceMost == 0 && instanceLeast == 0) {
            return ChoreographyDecision.ignore();
        }
        return ChoreographyDecision.wake(instanceMost, instanceLeast);
    },
    List.of("payment.confirmed"),
    eventBus
);
Enter fullscreen mode Exit fullscreen mode

The bridge implementation uses Java 21 pattern matching on the sealed hierarchy to
dispatch without instanceof chains:

// From FlowChoreographyBridge.handle()
switch (decision) {
    case ChoreographyDecision.Wake(long most, long least) -> {
        scheduler.lookupParked(most, least).ifPresent(scheduler::wake);
        // Stale or duplicate wake event: idempotent no-op if instance no longer parked
    }
    case ChoreographyDecision.Start(FlowExecutionPlan plan, long most, long least) -> {
        scheduler.schedule(plan, newContext(plan, most, least));
    }
    case ChoreographyDecision.Ignore() -> { /* intentional no-op */ }
}
Enter fullscreen mode Exit fullscreen mode

The ChoreographyDecision sealed interface gives the compiler exhaustiveness guarantees.
Adding a new decision variant without handling it in the bridge is a compile error,
not a silent runtime miss. This is the same pattern applied to FlowOutcome in the
step execution loop — sealed types as an architectural fence.

The mapper itself is a @FunctionalInterface. It closes over whatever correlation
state you need. The bridge does not prescribe how you resolve correlation IDs —
that is your domain logic.


Events and Flow: Two Layers, One Deduplication Contract

The Events subsystem (L3) and the Flow engine (L4) have an explicit deduplication split
that is worth stating directly, because it is easy to assume one layer handles it all.

EventBus delivers at-least-once. There is no built-in deduplication at the bus level
— that is an intentional design decision. Built-in bus-level dedup requires shared state
across all subscribers: either a heap-allocated ConcurrentHashMap (GC pressure) or a
distributed lock (latency). Neither is acceptable at the performance tier the Events
subsystem targets.

IdempotencyGuard at L4 covers the Flow-specific case: step-level dedup per Saga instance.
The split is:

Layer Deduplication Mechanism
EventBus (L3) None — at-least-once Subscriber responsibility
Flow Engine (L4) Per step, per instance IdempotencyGuard.tryClaimStep()
Application Per state mutation EventDescriptor.eventUuidHigh/Low as dedup key

This means a choreography wake event can arrive twice — Outbox retry, broker reconnect,
network duplicate. The bridge calls scheduler.lookupParked() then scheduler.wake().
If the flow is no longer parked (already woken by the first delivery), lookupParked()
returns empty and the second wake is a silent no-op. If the flow wakes and re-enters
a step it already executed, IdempotencyGuard skips it and advances. Two safety nets,
neither of which requires coordination across the two layers.

The integration point between them is FlowProgressPublisher: when a flow reaches a
terminal state, it optionally publishes a FlowProgress event to the EventBus. Only
terminal transitions emit — intermediate state changes are deliberately skipped to avoid
allocation on the hot scheduling path.


Idempotency Under Replay

Crash-recovery replay and choreography re-wakes create a real deduplication problem:
a step that already executed successfully should not execute again if the engine
restarts mid-flow.

IdempotencyGuard handles this at the step level:

// From CoreFlowRuntime.runStep() — called inside synchronized(instance.monitor())
if (guard != null && !guard.tryClaimStep(
        instance.key().instanceIdMost(),
        instance.key().instanceIdLeast(),
        stepIndex)) {
    // Step already claimed — skip and advance
    int nextGuardIndex = instance.plan().nextStep(stepIndex);
    if (nextGuardIndex < 0) {
        complete(instance);
        return -1;
    }
    return nextGuardIndex;
}
Enter fullscreen mode Exit fullscreen mode

The default CoreIdempotencyGuard is heap-backed: a ConcurrentMap<FlowKey, ConcurrentMap<Integer, Boolean>>.
The inner map keyed by step index makes releaseInstance() an O(1) removal of the
entire instance entry rather than O(claimed steps). On complete() or
FAILED_ROLLEDBACK, the guard releases the instance claim.

Custom IdempotencyGuard implementations — backed by Redis, a database, or
any other shared store — can be bound via KernelProviders.IDEMPOTENCY_GUARD
before submitting to the engine.

In the upcoming distributed model (ADR-013), idempotency becomes two-layered:
the in-memory heap CAS in CoreIdempotencyGuard, and the durable CAS on
FlowSnapshot.schemaVersion in JdbcFlowSnapshotStore. ADR-013 requires that
these two layers agree on terminal states — if the heap guard reports "already done"
but the durable row claims otherwise, the durable answer wins and heap state is
reconciled on next load. Neither layer can be the sole source of truth in a
distributed deployment.

This is not universal. It only covers the cases where the execution path is still
under the kernel's control. External side effects — the payment was initiated but
the response was lost — remain the caller's responsibility to make idempotent.
The guard prevents the kernel from re-executing a step it already claimed. It cannot
prevent a downstream service from seeing a duplicate call if the step succeeded before
the crash.


What Still Remains True

Compensation is powerful but it comes with a cost: every step that can be rolled back
must be written as an idempotent inverse operation. If your compensation action has
side effects that cannot be reversed — a notification already sent, an audit record
already written — you are modeling a business invariant that compensation cannot
satisfy. The engine provides the mechanism. The correctness is still yours to own.

The terminalStateCatalog in CoreFlowRuntime grows from start() until close().
It is an in-process idempotency fence that prevents re-scheduling already-terminal flows
within a single runtime lifetime. This is acceptable for the current operational model
but it is a known constraint: very long-lived runtimes with high flow volume will see
this map accumulate. Bounded retention is on the v0.7 roadmap — but any policy
must preserve the fence semantics. A terminal flow must not be re-executable.

Cross-restart choreography wake is not yet supported. After a restart, parked flows
survive only as snapshots in FlowSnapshotStore; lookupParked() currently resolves
only from the in-memory index. The v0.7 roadmap closes this via JdbcFlowSnapshotStore
with a parked-enumeration entry point — on startup the engine rehydrates wake routing
from the durable store. Steady-state choreography keeps the in-memory O(1) fast path;
the store probe is fallback-only on miss.

One constraint from that upcoming JDBC implementation is worth noting here: JdbcFlowSnapshotStore
is explicitly prohibited from using ThreadLocal for context propagation. DataSource
is constructor-injected through BootstrapContext. The same constraint that banned
ThreadLocal from the kernel's context propagation layer

runs all the way through to the distributed saga state layer.

Panama FFM does not appear in this layer directly. The allocation constraint is
enforced on the hot scheduling path via FlowZeroAllocTck, but orchestration state
lives on the heap. The off-heap ownership model from the transport layer is not the
right model for a state machine that needs to checkpoint, restore, and evolve
across JVM restarts.

One place where the zero-allocation philosophy does reach into this layer is
EventDescriptor — the routing metadata used by both the Events subsystem and
the choreography bridge. Seven primitive fields: two long pairs for the event
and stream UUIDs, two int for the ordinal and flags bitmap, one long for the
timestamp. The wire codec packs these into exactly 64 bytes — one CPU cache line,
with explicit MemoryLayout padding to fill it. C2 scalarises the record via
Escape Analysis today. When JEP 401 reaches GA, adding the value modifier
requires zero field changes — object headers disappear for free.


Conclusion

StructuredTaskScope is structured in space; Flow is structured in time. The
distinction only becomes visible at the inter-step boundary, where FlowOutcome.PARK
exits the Virtual Thread, persists state, and lets a different Virtual Thread resume
later. No STS abstraction maps to that shape. Choreography handles the case where
the wake comes from outside — sealed ChoreographyDecision and FlowChoreographyMapper
make the event-driven path type-safe and compiler-exhaustive, so adding a new decision
variant is a compile error rather than a silent runtime miss.

The trade-offs stay visible: schema migration requires drain-and-switch, compensation
correctness is the caller's responsibility, and the terminal state catalog is
runtime-scoped. None of these are problems Flow pretends to solve transparently.

The next concrete step is distributed saga state, decided in ADR-013. The model:
a shared durable FlowSnapshotStore (reference implementation: JdbcFlowSnapshotStore
over Postgres in v0.7 Sprint 3) as the source of truth, with Kafka choreography wake
for cross-service saga recovery (Sprint 5/6). Optimistic concurrency via
FlowSnapshot.schemaVersion — already in the SPI with SCHEMA_VERSION_INITIAL = 1L
— resolves concurrent advance attempts from two kernel nodes without a coordinator.

Three approaches were rejected before landing there. A distributed lock service
(ZooKeeper/etcd) puts blocking coordination on the saga advance path — inconsistent
with the No-Waste-Compute contract. CRDT-based state replication is semantically
unsound for the saga model: the compensation stack and step ordering are not
commutative, so concurrent advances cannot be safely merged at the data-structure level.
A single-leader coordinator with Raft/Paxos reintroduces the centralized stateful
component the kernel deliberately avoids.

What remains: cross-service choreography wake, parked-instance enumeration on startup,
and schemaVersion wire-through in RuntimeFlowInstance.toSnapshot(). The schema
migration constraint (drain-and-switch) is a separate problem that distributed saga
state does not soften — that requires explicit FlowSnapshot definition versioning,
which is scoped to a later milestone.


Explore the Exeris Kernel — zero-allocation architecture in running code:
🔗 exeris-systems/exeris-kernel

The Flow subsystem is in exeris-kernel-core and exeris-kernel-spi. The TCK
covering the full lifecycle — submit, park, wake, compensate, crash-recovery — is
in exeris-kernel-tck.

Top comments (0)