Benchmark metadata — scenario: e2e-shop-order-saga · hardware: dev-laptop · jdk: 26 · date: 2026-05-05 · status: descriptive · reproducibility: complete
The previous article in this series — Where StructuredTaskScope Ends — argued the architectural case for building a native Flow engine instead of adopting an existing saga framework. This is the empirical sequel: what those frameworks actually cost when you measure all three saga guarantees, not just forward progress.
I ran the comparative benchmark expecting latency differences. Spring + Axon did post the fastest p95 — but that surprised me less than something else in the same table. Under matched 3% failure injection, both Axon-based stacks reported zero compensations. Not "fewer than expected." Zero. That's what made me stop and look harder.
Saga frameworks are evaluated on throughput and latency. That's
incomplete, and sometimes misleading. Saga has three guarantees:
forward progress, compensation under failure, and termination.
Most benchmarks measure forward progress only — and then call the
framework "correct" if the happy path works.
Here's what happens when you measure all three.
The setup
Three saga frameworks. One identical scenario contract. Same
payload, same VU count, same think time, same payment_fail_rate
(3% configured failure injection), same JDK 26, same dev-laptop
hardware, same wire protocol (HTTP/1.1, loopback — all three apps
were configured for h2c, but k6 negotiated HTTP/1.1 in every run;
the protocol was identical across stacks regardless).
| Stack | Orchestration model |
|---|---|
| Exeris (open-core) | Native Flow engine — synchronous saga execution, in-process state machine |
| Quarkus 3 + Axon Framework + Neo4j | Async event-sourced saga via Axon Server (separate process) |
| Spring Boot 3 + Axon Framework + Neo4j | Same Axon model, different host runtime |
Difference: orchestration model, not problem domain. Same scenario
contract on every side.
Two terms used throughout the article need their definitions made
explicit. dev-laptop here is an AMD Ryzen 5 5600 (6 cores /
12 threads), 32 GB DDR4, running Linux with a full graphical
desktop environment, with all benchmark components — k6, the
three target applications, and Axon Server where applicable —
co-located on loopback. perf-box-amd64, the next milestone
for comparative claims, is baremetal with real WAN between the
load generator and the target application. The CPU may or may not
be faster than the dev-laptop — that isn't the point. The point
of the perf-box re-run is removing the localhost shortcut and the
desktop-environment noise, and exposing the saga path to honest
network latency and operational shape.
The benchmark contract — e2e-shop-order-saga — exercises a
five-step order saga: register customer → recommend products → add
to cart → get cart → create order. Payment failure is injected
at 3% rate. Compensation must roll back stock reservation and
release the order.
Reproducibility metadata lives in the published result.json for
every run:
{
"scenario_id": "e2e-shop-order-saga",
"contract_id": "exeris_community_h2c_v1",
"hardware_profile": "dev-laptop",
"jdk_version": "26",
"claim_scope": "exploratory",
"reproducibility_status": "complete"
}
Every published number carries claim_scope, hardware_profile,
transport_mode, and reproducibility_status. Hard fairness
gates reject runs that fail equivalence checks. Comparative claims
require matched scenario contract on both sides — everything else
is labeled descriptive_only or exploratory. This is how every
claim below was evaluated, including my own.
→ scenario contract
→ raw artifacts
The numbers that look reasonable until you read them
| Metric | Exeris | Quarkus | Spring |
|---|---|---|---|
| http_req_duration p50 | 4.59 ms | 4.18 ms | 3.10 ms |
| http_req_duration p95 | 16.6 ms | 30.3 ms | 14.6 ms |
| iteration_duration p95 | 9.18 s | 8.49 s | 9.28 s |
| Saga success rate | 96.7% | 98.2% | 98.8% |
| Compensation rate (cfg: 3%) | 3.32% ✓ | 0% ✗ | 0% ✗ |
| Saga unresolved rate | 0% ✓ | 1.82% ✗ | 1.22% ✗ |
| App peak RSS | 459 MB | 752 MB | 1,312 MB |
| App peak threads | 66 | 94 | 81 |
| App CPU time | 24.7 s | 22.3 s | 33.3 s |
| Axon Server RSS (separate proc) | — | ~787 MB | ~850 MB |
| Axon Server PIDs | — | ~120-126 | ~133-148 |
Look at the top of that table. Spring Boot 3 + Axon shows the
fastest p50 and p95. Quarkus and Exeris are in the same band.
By the standard mental model — the one most performance posts
are written under — Spring wins.
Now look at the bottom.
Both Axon-based stacks ran the same configured 3% payment
failure rate as Exeris. Exeris compensated 3.32% of sagas,
matching the configured rate within statistical noise. Both
Quarkus and Spring Boot reported 0% compensations.
Where did the failures go? Into "saga unresolved" — sagas that
never reached a terminal state before the test window closed.
1.82% for Quarkus, 1.22% for Spring.
This isn't a performance gap. It's a correctness gap. And the two
Axon-based stacks share it for the same reason.
What you measure depends on where you draw the boundary
Insight
The dropped compensations and the "fast" latency are not two facts.
They are one fact: a framework that returns before work is done
will always show both.
Here's what's structurally happening.
Exeris-native Flow runs the saga inline. The order endpoint
returns when the saga state machine has progressed — when a
compensation has actually executed, the response carries that
fact. The 16.6 ms p95 you see in the table measures actual saga
progress, including compensation work on the failure paths.
Spring + Axon dispatches the order command to Axon Server
asynchronously. The endpoint returns 202 Accepted as soon as the
event is published. The saga state machine continues running in
a separate process, with no one waiting for it on the request
path. The 14.6 ms p95 measures the time to publish to Axon
Server. Not the time to do the work.
Quarkus + Axon has the same architecture. Same async dispatch.
Same illusion.
This is also exactly why both Axon-based stacks drop
compensations. The saga hasn't finished by the time the k6 test
window closes. It's still running in Axon Server, async, with
no one polling. The benchmark observes "0% compensations" because
the compensation work hasn't run yet, not because no failures
occurred. The 1.2–1.8% sagas in non-terminal state are the
fingerprint.
The same boundary problem applies to every other resource on
that table:
Memory:
Spring app 1.3 GB + Axon Server ~850 MB = ~2.16 GB
Quarkus app 752 MB + Axon Server ~787 MB = ~1.54 GB
Exeris all-in 459 MB = 459 MB
Threads:
Spring app 81 + Axon Server ~140 = ~221
Quarkus app 94 + Axon Server ~123 = ~217
Exeris all-in = 66
CPU time:
Spring app 33 s + Axon Server ~52 s = ~85 s
Quarkus app 22 s + Axon Server ~26 s = ~48 s
Exeris all-in = 25 s
Each of those metrics tells the same story. Drawing the boundary
tightly around the application process makes everything look
reasonable. Drawing it around "what does it take to run a saga
end-to-end?" — every metric is dominated by what's outside the
application process.
This is the real architectural cost of async dispatch: not just
the dropped compensations, but the second process you have to
feed CPU, RAM, threads, network, and operational attention.
Exeris-native Flow chose the in-process path. That choice is
also why it correctly compensates.
Two host runtimes, same defect
Two unrelated host runtimes (Quarkus 3 and Spring Boot 3) share
the same correctness defect. That rules out host-runtime quirks.
The shared cause is Axon's event-sourced async dispatch model
colliding with a benchmark window of finite length.
Axon's SubscribingEventProcessor (the variant used here) processes
events on a thread distinct from the dispatcher. Saga state
transitions land on that processor thread. Status reads — what k6
polls to determine "did this saga complete?" — go through Axon's
in-memory ConcurrentHashMap projection, which is updated only
after the event handler runs.
When the benchmark window ends, in-flight events that haven't
been processed are still in the queue. The saga's state never
reaches COMPLETED or COMPENSATED from the test's vantage point.
k6 sees a non-terminal state and counts it as saga_unresolved.
I want to be clear about what this is and isn't. It's not a bug
in Axon — it's the design. Async event-sourced sagas are designed
to eventually reach terminal state, given enough time. The
benchmark window — 180 seconds — is finite. The saga's
"eventually" doesn't fit in that window for some percentage of
failed sagas.
Exeris-native Flow has no such gap because the saga state machine
runs synchronously on the request VT, with off-heap state
persisted before the HTTP response. When the response says "saga
complete", the saga is complete. When the response says "saga
compensated", the compensation has executed.
You can verify this by inspecting the JFR snapshots that
accompany every benchmark run. The Exeris JFR shows compensation
events on the request VT timeline, completing within the request.
The Axon JFRs show event handler invocations that don't always
complete before the request VT exits.
Where structured concurrency helps — and where it stops
A saga isn't a fork-join problem. STS (Structured Task Scope)
gives you bounded fork-and-join, structured cancellation,
explicit error propagation. Useful primitives, but not enough on
their own:
- STS does not give you state persistence between handler invocations
- STS does not give you compensation queue durability across crash
- STS does not give you idempotency for retry-after-restart
Exeris-native Flow combines STS for in-flight orchestration with
an off-heap flow state machine and an outbox for crash-safe
compensation. STS is half the saga story, not the whole one.
The Axon model is the inverse: it solves persistence and
durability (Axon Server holds events) but trades synchronous
correctness for async throughput. That trade is fine — until you
benchmark with finite windows and configured failure injection.
Where this still doesn't generalize
Because rigor matters more than mass distribution, here's what
the data above does and doesn't support:
-
Hardware: dev-laptop, not perf-box. Re-run on
perf-box-amd64is on the roadmap before anycomparison_eligibleclaim is published. - Failure rate: configured 3% only. Extreme failure rates (10–30%) and long-tail rare failures not yet validated.
- Axon variant: Axon Server architecture, not embedded EventStore. The embedded variant may behave differently.
- Graph backing: Neo4j Bolt driver in path; PG-only variant not yet measured.
- Window: 180s. Not enough for observation of long-tail compensation. A 1800s and longer windows would likely shift the numbers for both Axon stacks toward better compensation rates — but that's a separate experiment.
Each of these is a known scope limit, published as next milestones
in the public benchmark suite roadmap.
Reproduction
The whole thing is reproducible. The published artifacts include:
scenarios/e2e-shop-order-saga/
scenario.json ← protocol contract
comparative-pair-manifest.json ← what runs comparable
seed/
seed-manifest.json ← deterministic seed data
verify-seed.sh ← seed verification script
results/raw/e2e-shop-order-saga/
20260505T115008Z-baseline/ ← Quarkus + Axon run
result.json
target-quarkus-app-axon-*.jfr
logs/axonserver-docker-stats.csv
20260505T115722Z-baseline/ ← Exeris run
result.json
target-exeris-community-app-*.jfr
20260505T120906Z-baseline/ ← Spring + Axon run
result.json
target-spring-app-axon-*.jfr
logs/axonserver-docker-stats.csv
scripts/
run-e2e-shop-order-saga-campaign.sh ← multi-target reproduction (full comparison)
run-e2e-shop-order-saga-baseline.sh ← single-target run (called by campaign)
A full reproduction of the three runs above is one invocation of
the campaign script:
./scripts/run-e2e-shop-order-saga-campaign.sh \
--targets exeris-community-app,quarkus-app-axon,spring-app-axon \
--graph-track neo4j \
--repeats 1
Anyone can rerun this on their own hardware and verify the
correctness asymmetry. That's the only thing that matters for
this kind of claim.
What I trust about this, and what I don't
I trust the compensation correctness asymmetry. It reproduces.
It has a mechanical explanation. It shows up in two unrelated
host runtimes for the same Axon model. The claim is structurally
defensible: async-dispatch saga frameworks return before work is
done; that's why the latency looks fast and the compensations
go missing.
I trust the full-system footprint comparison (memory, threads,
CPU). The Axon Server process is real, has a measurable cost,
and that cost belongs in the comparison. It's not a critique of
Axon — Axon Server is doing its job. It's a critique of
benchmarks that draw the boundary tightly around the application
JVM and forget the orchestration backend.
I don't trust the raw latency numbers as comparative evidence.
They're descriptive only. p50 and p95 differences within the
3–10 ms range on dev-laptop, single-tenant loopback, are noise
relative to the 1.5–4× range that would matter for production
deployment decisions. The interesting story isn't "Exeris is
faster than Spring" — because Spring's 14.6 ms is dispatch time,
not work time, so the comparison doesn't even type-check.
The numbers I trust most are the ones I'm most afraid to publish
— because they include what's not yet validated. That's also why
they're labeled descriptive and exploratory, not
comparison_eligible.
The next milestone closes the hardware gap (perf-box-amd64
re-run) and explores wider failure rates. The data above will
either hold up or fall apart. Either is fine — that's why it's
public.
Reproducibility
Run it yourself:
github.com/exeris-systems/exeris-benchmarks
— scenario contract, raw artifacts, and reproduction scripts.
JDK 26, Docker Compose, k6. Single command, single contract.
Disagree with what I measured? Open an issue with your numbers.
The Flow engine producing the compensation correctness above is in exeris-kernel-core and exeris-kernel-spi. The TCK covering compensation under crash-recovery is in exeris-kernel-tck:
🔗 exeris-systems/exeris-kernel
Top comments (0)