DEV Community

Arkadiusz Przychocki
Arkadiusz Przychocki

Posted on • Originally published at blog.arkstack.dev

What you measure depends on where you draw the boundary

Benchmark metadata — scenario: e2e-shop-order-saga · hardware: dev-laptop · jdk: 26 · date: 2026-05-05 · status: descriptive · reproducibility: complete

The previous article in this series — Where StructuredTaskScope Ends — argued the architectural case for building a native Flow engine instead of adopting an existing saga framework. This is the empirical sequel: what those frameworks actually cost when you measure all three saga guarantees, not just forward progress.

I ran the comparative benchmark expecting latency differences. Spring + Axon did post the fastest p95 — but that surprised me less than something else in the same table. Under matched 3% failure injection, both Axon-based stacks reported zero compensations. Not "fewer than expected." Zero. That's what made me stop and look harder.

Saga frameworks are evaluated on throughput and latency. That's
incomplete, and sometimes misleading. Saga has three guarantees:
forward progress, compensation under failure, and termination.
Most benchmarks measure forward progress only — and then call the
framework "correct" if the happy path works.

Here's what happens when you measure all three.

The setup

Three saga frameworks. One identical scenario contract. Same
payload, same VU count, same think time, same payment_fail_rate
(3% configured failure injection), same JDK 26, same dev-laptop
hardware, same wire protocol (HTTP/1.1, loopback — all three apps
were configured for h2c, but k6 negotiated HTTP/1.1 in every run;
the protocol was identical across stacks regardless).

Stack Orchestration model
Exeris (open-core) Native Flow engine — synchronous saga execution, in-process state machine
Quarkus 3 + Axon Framework + Neo4j Async event-sourced saga via Axon Server (separate process)
Spring Boot 3 + Axon Framework + Neo4j Same Axon model, different host runtime

Difference: orchestration model, not problem domain. Same scenario
contract on every side.

Two terms used throughout the article need their definitions made
explicit. dev-laptop here is an AMD Ryzen 5 5600 (6 cores /
12 threads), 32 GB DDR4, running Linux with a full graphical
desktop environment, with all benchmark components — k6, the
three target applications, and Axon Server where applicable —
co-located on loopback. perf-box-amd64, the next milestone
for comparative claims, is baremetal with real WAN between the
load generator and the target application. The CPU may or may not
be faster than the dev-laptop — that isn't the point. The point
of the perf-box re-run is removing the localhost shortcut and the
desktop-environment noise, and exposing the saga path to honest
network latency and operational shape.

The benchmark contract — e2e-shop-order-saga — exercises a
five-step order saga: register customer → recommend products → add
to cart → get cart → create order. Payment failure is injected
at 3% rate. Compensation must roll back stock reservation and
release the order.

Reproducibility metadata lives in the published result.json for
every run:

{
  "scenario_id": "e2e-shop-order-saga",
  "contract_id": "exeris_community_h2c_v1",
  "hardware_profile": "dev-laptop",
  "jdk_version": "26",
  "claim_scope": "exploratory",
  "reproducibility_status": "complete"
}
Enter fullscreen mode Exit fullscreen mode

Every published number carries claim_scope, hardware_profile,
transport_mode, and reproducibility_status. Hard fairness
gates reject runs that fail equivalence checks. Comparative claims
require matched scenario contract on both sides — everything else
is labeled descriptive_only or exploratory. This is how every
claim below was evaluated, including my own.

→ scenario contract
→ raw artifacts

The numbers that look reasonable until you read them

Metric Exeris Quarkus Spring
http_req_duration p50 4.59 ms 4.18 ms 3.10 ms
http_req_duration p95 16.6 ms 30.3 ms 14.6 ms
iteration_duration p95 9.18 s 8.49 s 9.28 s
Saga success rate 96.7% 98.2% 98.8%
Compensation rate (cfg: 3%) 3.32% ✓ 0% ✗ 0% ✗
Saga unresolved rate 0% ✓ 1.82% ✗ 1.22% ✗
App peak RSS 459 MB 752 MB 1,312 MB
App peak threads 66 94 81
App CPU time 24.7 s 22.3 s 33.3 s
Axon Server RSS (separate proc) ~787 MB ~850 MB
Axon Server PIDs ~120-126 ~133-148

Look at the top of that table. Spring Boot 3 + Axon shows the
fastest p50 and p95. Quarkus and Exeris are in the same band.
By the standard mental model — the one most performance posts
are written under — Spring wins.

Now look at the bottom.

Both Axon-based stacks ran the same configured 3% payment
failure rate as Exeris. Exeris compensated 3.32% of sagas,
matching the configured rate within statistical noise. Both
Quarkus and Spring Boot reported 0% compensations.

Where did the failures go? Into "saga unresolved" — sagas that
never reached a terminal state before the test window closed.
1.82% for Quarkus, 1.22% for Spring.

This isn't a performance gap. It's a correctness gap. And the two
Axon-based stacks share it for the same reason.

What you measure depends on where you draw the boundary

Insight

The dropped compensations and the "fast" latency are not two facts.
They are one fact: a framework that returns before work is done
will always show both.

Here's what's structurally happening.

Exeris-native Flow runs the saga inline. The order endpoint
returns when the saga state machine has progressed — when a
compensation has actually executed, the response carries that
fact. The 16.6 ms p95 you see in the table measures actual saga
progress, including compensation work on the failure paths.

Spring + Axon dispatches the order command to Axon Server
asynchronously. The endpoint returns 202 Accepted as soon as the
event is published. The saga state machine continues running in
a separate process, with no one waiting for it on the request
path. The 14.6 ms p95 measures the time to publish to Axon
Server. Not the time to do the work.

Quarkus + Axon has the same architecture. Same async dispatch.
Same illusion.

This is also exactly why both Axon-based stacks drop
compensations. The saga hasn't finished by the time the k6 test
window closes. It's still running in Axon Server, async, with
no one polling. The benchmark observes "0% compensations" because
the compensation work hasn't run yet, not because no failures
occurred. The 1.2–1.8% sagas in non-terminal state are the
fingerprint.

The same boundary problem applies to every other resource on
that table:

Memory:
  Spring app 1.3 GB     +  Axon Server ~850 MB  =  ~2.16 GB
  Quarkus app 752 MB    +  Axon Server ~787 MB  =  ~1.54 GB
  Exeris all-in 459 MB                          =     459 MB

Threads:
  Spring app 81         +  Axon Server ~140    =  ~221
  Quarkus app 94        +  Axon Server ~123    =  ~217
  Exeris all-in                                 =    66

CPU time:
  Spring app 33 s       +  Axon Server ~52 s   =  ~85 s
  Quarkus app 22 s      +  Axon Server ~26 s   =  ~48 s
  Exeris all-in                                 =    25 s
Enter fullscreen mode Exit fullscreen mode

Each of those metrics tells the same story. Drawing the boundary
tightly around the application process makes everything look
reasonable. Drawing it around "what does it take to run a saga
end-to-end?" — every metric is dominated by what's outside the
application process.

This is the real architectural cost of async dispatch: not just
the dropped compensations, but the second process you have to
feed CPU, RAM, threads, network, and operational attention.
Exeris-native Flow chose the in-process path. That choice is
also why it correctly compensates.

Two host runtimes, same defect

Two unrelated host runtimes (Quarkus 3 and Spring Boot 3) share
the same correctness defect. That rules out host-runtime quirks.
The shared cause is Axon's event-sourced async dispatch model
colliding with a benchmark window of finite length.

Axon's SubscribingEventProcessor (the variant used here) processes
events on a thread distinct from the dispatcher. Saga state
transitions land on that processor thread. Status reads — what k6
polls to determine "did this saga complete?" — go through Axon's
in-memory ConcurrentHashMap projection, which is updated only
after the event handler runs.

When the benchmark window ends, in-flight events that haven't
been processed are still in the queue. The saga's state never
reaches COMPLETED or COMPENSATED from the test's vantage point.
k6 sees a non-terminal state and counts it as saga_unresolved.

I want to be clear about what this is and isn't. It's not a bug
in Axon — it's the design. Async event-sourced sagas are designed
to eventually reach terminal state, given enough time. The
benchmark window — 180 seconds — is finite. The saga's
"eventually" doesn't fit in that window for some percentage of
failed sagas.

Exeris-native Flow has no such gap because the saga state machine
runs synchronously on the request VT, with off-heap state
persisted before the HTTP response. When the response says "saga
complete", the saga is complete. When the response says "saga
compensated", the compensation has executed.

You can verify this by inspecting the JFR snapshots that
accompany every benchmark run. The Exeris JFR shows compensation
events on the request VT timeline, completing within the request.
The Axon JFRs show event handler invocations that don't always
complete before the request VT exits.

Where structured concurrency helps — and where it stops

A saga isn't a fork-join problem. STS (Structured Task Scope)
gives you bounded fork-and-join, structured cancellation,
explicit error propagation. Useful primitives, but not enough on
their own:

  • STS does not give you state persistence between handler invocations
  • STS does not give you compensation queue durability across crash
  • STS does not give you idempotency for retry-after-restart

Exeris-native Flow combines STS for in-flight orchestration with
an off-heap flow state machine and an outbox for crash-safe
compensation. STS is half the saga story, not the whole one.

The Axon model is the inverse: it solves persistence and
durability (Axon Server holds events) but trades synchronous
correctness for async throughput. That trade is fine — until you
benchmark with finite windows and configured failure injection.

Where this still doesn't generalize

Because rigor matters more than mass distribution, here's what
the data above does and doesn't support:

  • Hardware: dev-laptop, not perf-box. Re-run on perf-box-amd64 is on the roadmap before any comparison_eligible claim is published.
  • Failure rate: configured 3% only. Extreme failure rates (10–30%) and long-tail rare failures not yet validated.
  • Axon variant: Axon Server architecture, not embedded EventStore. The embedded variant may behave differently.
  • Graph backing: Neo4j Bolt driver in path; PG-only variant not yet measured.
  • Window: 180s. Not enough for observation of long-tail compensation. A 1800s and longer windows would likely shift the numbers for both Axon stacks toward better compensation rates — but that's a separate experiment.

Each of these is a known scope limit, published as next milestones
in the public benchmark suite roadmap.

Reproduction

The whole thing is reproducible. The published artifacts include:

scenarios/e2e-shop-order-saga/
  scenario.json                          ← protocol contract
  comparative-pair-manifest.json         ← what runs comparable
  seed/
    seed-manifest.json                   ← deterministic seed data
    verify-seed.sh                       ← seed verification script

results/raw/e2e-shop-order-saga/
  20260505T115008Z-baseline/             ← Quarkus + Axon run
    result.json
    target-quarkus-app-axon-*.jfr
    logs/axonserver-docker-stats.csv
  20260505T115722Z-baseline/             ← Exeris run
    result.json
    target-exeris-community-app-*.jfr
  20260505T120906Z-baseline/             ← Spring + Axon run
    result.json
    target-spring-app-axon-*.jfr
    logs/axonserver-docker-stats.csv

scripts/
  run-e2e-shop-order-saga-campaign.sh    ← multi-target reproduction (full comparison)
  run-e2e-shop-order-saga-baseline.sh    ← single-target run (called by campaign)
Enter fullscreen mode Exit fullscreen mode

A full reproduction of the three runs above is one invocation of
the campaign script:

./scripts/run-e2e-shop-order-saga-campaign.sh \
  --targets exeris-community-app,quarkus-app-axon,spring-app-axon \
  --graph-track neo4j \
  --repeats 1
Enter fullscreen mode Exit fullscreen mode

Anyone can rerun this on their own hardware and verify the
correctness asymmetry. That's the only thing that matters for
this kind of claim.

What I trust about this, and what I don't

I trust the compensation correctness asymmetry. It reproduces.
It has a mechanical explanation. It shows up in two unrelated
host runtimes for the same Axon model. The claim is structurally
defensible: async-dispatch saga frameworks return before work is
done; that's why the latency looks fast and the compensations
go missing.

I trust the full-system footprint comparison (memory, threads,
CPU). The Axon Server process is real, has a measurable cost,
and that cost belongs in the comparison. It's not a critique of
Axon — Axon Server is doing its job. It's a critique of
benchmarks that draw the boundary tightly around the application
JVM and forget the orchestration backend.

I don't trust the raw latency numbers as comparative evidence.
They're descriptive only. p50 and p95 differences within the
3–10 ms range on dev-laptop, single-tenant loopback, are noise
relative to the 1.5–4× range that would matter for production
deployment decisions. The interesting story isn't "Exeris is
faster than Spring" — because Spring's 14.6 ms is dispatch time,
not work time, so the comparison doesn't even type-check.

The numbers I trust most are the ones I'm most afraid to publish
— because they include what's not yet validated. That's also why
they're labeled descriptive and exploratory, not
comparison_eligible.

The next milestone closes the hardware gap (perf-box-amd64
re-run) and explores wider failure rates. The data above will
either hold up or fall apart. Either is fine — that's why it's
public.


Reproducibility

Run it yourself:
github.com/exeris-systems/exeris-benchmarks
— scenario contract, raw artifacts, and reproduction scripts.
JDK 26, Docker Compose, k6. Single command, single contract.
Disagree with what I measured? Open an issue with your numbers.


The Flow engine producing the compensation correctness above is in exeris-kernel-core and exeris-kernel-spi. The TCK covering compensation under crash-recovery is in exeris-kernel-tck:
🔗 exeris-systems/exeris-kernel

Top comments (0)