DEV Community: Arkadiusz Przychocki

What I Knew, and What I Built

Arkadiusz Przychocki — Wed, 17 Jun 2026 10:10:02 +0000

This is the eighth article in the Exeris Kernel series. The first
seven covered decisions I'm confident about:
context propagation,
where runtime polymorphism stops paying for itself,
where StructuredTaskScope earns its keep,
off-heap TLS,
the Flow engine,
compensation correctness,
and the LoanedBuffer pattern.
This one is different. It's about two contracts I had to retrofit
after the initial design closed, and the one thing they have in
common that I didn't see until both had already happened.

I had one architectural vision for Exeris Kernel and two
oversights in executing it. They look unrelated from the outside —
one is in the HTTP client, the other is in the security
subsystem — but they have the same shape. In both cases I built
a single concrete instance of something the architecture needed
as a pluggable seam. And in both cases the gap stayed invisible
until I read the consuming code end to end.

That last part is the whole point of this article, so I'll state
it up front: I did not catch either of these in a design review.
I caught them by being the consumer of my own kernel. The reason
that matters — and the reason this isn't just a confession — is
that most people who design a contract never occupy the position
that reveals its gaps. They ship the contract, someone else
consumes it two layers downstream, and the gap surfaces as a
production incident. I caught mine early and on my own only
because I don't only work on the kernel. I also build the tooling
that generates code against it, and the product that runs on top
of that. I sit on both sides of the boundary I designed. And
"early" is literal: I found the codec gap in local testing,
reading both the generator in exeris-tooling and the code it
emits. The tooling that emits this code isn't release-tagged yet —
generated output is regenerated on demand, with nothing published
for a downstream app to pin. So the version of the generator that
hard-wired the concrete codec never had a consumer depending on it;
by the time the rewrite that consumes the SPI landed, the contract
it consumes was already in a released kernel. That's not luck. I was
standing at the consumer edge while the contract still had zero
dependents — the cheapest possible moment to find a gap like this.

There's a third, smaller version of the same mistake nested
inside the fix for the first one. I'll get to it — it's the part
I find most useful, because it shows the blind spot isn't a
one-time lapse you can resolve by being more careful. It recurs,
even while you're actively fixing an instance of it.

The HttpClient I Built and Couldn't Use

A note on version numbers first, because there are two tracks and
conflating them is easy. The numbered versions in this article are
kernel releases — 0.5.0 and 0.8.0 below, 0.8.1 current, all tagged
on GitHub. The tooling that consumes these seams is a separate
track: implemented through 0.4, sitting at 0.5.0-SNAPSHOT,
deliberately not release-tagged yet. So when I say a fix "shipped,"
I mean it's in the released kernel. When I say a generator rewrite
is "sequenced," I mean the tooling track, which hasn't tagged a
release on purpose — more on why at the end.

The break point was concrete. I was building the SDK/tooling
layer — the code generator in exeris-tooling that emits typed
per-entity REST clients (WidgetClient.findById(id),
WidgetClient.create(widget)) against whatever kernel artifact
the application's POM selects. The generator's whole job is to be
tier-neutral: it emits code that compiles against the Community
tier (HTTP/1.1 + HTTP/2) or the Enterprise tier (HTTP/3 + Panama
JSON, when it ships) without knowing or caring which.

It couldn't. The HTTP client facade I'd shipped — then called
CommunityWebClient — hard-wired Jackson 3 as the JSON codec.
Body marshalling lived inside the facade: mapper.writeValueAsBytes
on the way out, mapper.readValue on the way in. The constructor
took an ObjectMapper directly. Two consequences fell out of that,
and I'd internalized neither until the generator forced them into
view.

The first was that the codec was not swappable. If a consumer
wanted Protobuf, CBOR, or a future Panama-native JSON binding,
there was no seam to plug into — the format was welded into the
facade. That was a deliberate decision when I made it. Jackson
was already the codec the rest of the kernel linked; exposing
ObjectMapper on the constructor was the path of least resistance
when I built the facade (ADR-026), and I told myself the kernel had
no concrete need for a second codec yet. Locally rational. The problem is that "no
concrete need yet" was a statement about the kernel in isolation,
and the consumer that needed the seam — the tier-neutral generator —
didn't exist yet when I made the call.

The second consequence was worse, because it wasn't even a
decision I remembered making. The facade was named
CommunityWebClient, and the generator emitted that name into
every client file it produced:

public final class WidgetClient {
    private final CommunityWebClient client;          // tier identity, in user code
    public WidgetClient(CommunityWebClient client) { ... }
    public Optional<Widget> findById(UUID id) {
        try { return Optional.ofNullable(client.get(...)); }
        catch (CommunityWebClient.WebClientException ex) { ... }  // tier identity again
    }
}

The tier name leaked into every application class the tooling
generated. This is the build-time analogue of a rule I'd already
written down and enforced everywhere else in the kernel — The
Wall (ADR-006): the kernel boundary visible to applications must
be implementation-blind. I had enforced it rigorously at runtime.
I had not noticed it applied at build time too, to the symbols
that show up in generated source.

The pattern I'd already built — once

Here's the part that made this sting. The server side already had
the SPI I was missing on the client side. HttpResponseBodyEncoder
had been a tier-neutral SPI seam since 0.5.0 (the ADR-009 era):
an encoder contract, an encoding-context record, a registry with
an empty() factory, a reusable off-heap HttpEncodedBody
carrier, and a Community-side Jackson driver implementing it. I
had designed exactly the right shape — encoder + context +
registry, with the concrete codec as a swappable driver — and
then built it in precisely one place and not noticed that the same
shape was needed in three others.

The body-codec design space is a 2×2 matrix: {request, response} × {encode, decode}. One cell had a proper SPI from the start —
server-side response-encode, shipped at 0.5.0 (ADR-009). The other
three were hard-wired to Jackson, welded in at different times as
each surface got built: client-side request-encode and
response-decode went in when I later built the client facade
(CommunityWebClient, ADR-026), and server-side request-decode
lived in the generated request handlers. I wasn't missing the
concept of a codec SPI. I'd authored it — before the facade that
needed it even existed — and then didn't reuse it when I built that
facade, because at the time the one cell I'd filled was the only
one with a consumer staring at it.

	encode	decode
request	client-side — ADR-034 (new)	server-side — ADR-036 (overlooked)
response	server-side — ADR-009 (0.5.0)	client-side — ADR-034 (new)

The fix (ADR-034)

ADR-034 closed the client-side half. Six new SPI types in
eu.exeris.kernel.spi.http, deliberately mirroring the existing
server-side encoder triplet for grep symmetry:
HttpRequestBodyEncoder, HttpRequestEncodingContext,
HttpRequestBodyEncoderRegistry on the encode side;
HttpResponseBodyDecoder, HttpResponseDecodingContext,
HttpResponseBodyDecoderRegistry on the decode side. The Jackson
binding became two Community drivers (CommunityJsonRequestBodyEncoder,
CommunityJsonResponseBodyDecoder) behind those contracts. The
HttpEncodedBody carrier from 0.5.0 was reused rather than
duplicated.

The cost is real and worth naming: every body now crosses a
registry resolution and a driver indirection that in the original
facade was a direct mapper.writeValueAsBytes call. For the single-codec case
that's overhead I'm paying to keep the seam open — a deliberate
trade, not a free win.

The facade itself moved out of the Community tier into Core and
lost the tier name: CommunityWebClient became KernelWebClient
in eu.exeris.kernel.core.http.client. Its constructor lost the
ObjectMapper parameter — Jackson descended to the codec driver,
and Core stopped having any opinion about JSON at all. The
generator now emits KernelWebClient into application code; no
tier identity surfaces. ADR-034 superseded ADR-026, which had
been the facade's previous home and had treated the tier-name
leak as a string-substitution problem to fix in lockstep rather
than the structural symptom it actually was.

I want to be precise about what was a decision and what was an
oversight here, because they're not the same and conflating them
would be too kind to myself. Hard-wiring Jackson into the client
facade was a
decision whose limits I underestimated — defensible when made,
wrong once the consumer arrived. The tier-name leak was not a
decision at all. It was something I'd have caught instantly if I'd
read the generator's output as application code instead of as my
own code. The difference between those two is the difference
between miscalibrated judgment and not looking — and only one of
them is fixable by thinking harder at design time.

The oversight inside the fix (ADR-036)

This is the part I find most useful, and it's why I'm writing the
article at all.

ADR-034 completed three of the four matrix quadrants:
server-response-encode (already there since 0.5.0),
client-request-encode (new), client-response-decode (new). It left
the fourth — server-side request-body decode, the wire body that
becomes a typed handler argument like Widget for POST /widgets.
That quadrant had no SPI seam at all. It was hard-wired Jackson
inside generated code: the tooling's KernelHandlerGenerator
emitted a static Jackson MAPPER field and a JacksonException
import into every controller it produced. The exact same
build-time Wall breach I'd just fixed on the client side, sitting
untouched on the server side, in a different generator.

It surfaced the exact same way the original problem did — by
reading the consuming code end to end. The first set of gaps came
out when I re-read KernelClientGenerator. This one came out when
I re-read KernelHandlerGenerator. Same act, different file, same
class of finding.

I want to separate this cleanly from a decision I did make
deliberately in ADR-034, because the honest version of this story
depends on the distinction. ADR-034 explicitly deferred one thing:
unifying the server-side encode path — refactoring the working
0.5.0 JsonBodyEncoder so it subscribes a content-type the way
the new client decoder does. That was a choice, recorded as a
v1.0 cleanup, because it's a refactor of a seam that already
works. The server-side request-decode quadrant was a different
thing entirely: not a seam I chose to leave alone, but a seam that
never existed, hidden inside generated code I wasn't looking at.
One was deferred. The other was overlooked. ADR-036 closed the
overlooked one, mirroring the response-decoder triplet verbatim
into HttpRequestBodyDecoder / HttpRequestDecodingContext /
HttpRequestBodyDecoderRegistry. Both ADRs shipped in kernel
0.8.0 — the four-quadrant codec SPI is in the released kernel
today. The tooling generator that consumes the new server-decode
seam has since been rewritten to resolve through the registry
instead of baking in a concrete codec symbol; where that leaves the
tooling release, I come back to at the end.

This is the part that matters: I was actively working on exactly
this problem — codec seams, build-time Wall hygiene, the matrix —
and still left one cell hard-wired, because the cell lived in a
consumer I hadn't re-read yet. Not carelessness. I was looking at
three of the four places the pattern lived; the fourth was
somewhere I hadn't pointed my attention. The closing section is
about why that's structural rather than a discipline failure.

The Validator I Built and the Seam I Didn't

The second oversight is in the security subsystem, and it's the
one that taught me the most — because the first time I described
it to myself, I got it wrong. My initial framing was "I forgot to
build identity provider support." That's not what happened, and
the real version is more useful.

I did not forget about identity. The kernel has had token
validation since 0.5.0. SecurityProvider.authenticate(LoanedBuffer rawToken) returns an AuthenticationResult carrying a
PrincipalContext (principal id, tenant id, roles, scopes) and a
StorageContext (the multi-tenant routing decision). That seam is
consumed at the HTTP edge today. Behind it sits
CommunityJwksValidator — a Nimbus JOSE pipeline that does the
real work: kid → key → signature → issuer → audience → expiry,
fail-closed at every step per ADR-012. Identity validation isn't
missing. It works, and it's been working.

What I built was a single, static, RSA-only validator fused into
one SecurityProvider implementation. The key set is an immutable
Map<kid, RSAPublicKey> injected at construction — no rotation,
no JWKS fetch, no EC, and crucially, no way to host a second
identity provider without subclassing or duplicating the
cross-cutting logic that sits around validation. The validation
pipeline is already about eighty percent of an OIDC resource-server
validator. I just welded that eighty percent into one concrete
class instead of exposing it as a contract.

I didn't notice this was the same mistake as the HttpClient codec
until I'd already written the validator down the same way. Different
subsystem, identical shape: there I built one codec quadrant and
left the rest concrete; here I built one validator and left no seam
for a second. In both cases the thing I
shipped works perfectly for the single case it handles. In both
cases the architecture needed a pluggable contract, and I gave it
a concrete instance.

Where it surfaced

The HttpClient gap surfaced when I read the tooling generators.
This one surfaced when I tried to put a real product on top of the
kernel.

A B2B SaaS deployment doesn't authenticate against one identity
provider. It fronts employees through one (say Okta), B2C users
through another (Auth0), and service-to-service traffic through a
third (something internal). Dispatch happens by inspecting the
token's issuer before validation. My single-SecurityProvider
architecture had no room for that. The only way to bolt on a
second provider was an application-level composite that would have
to re-implement the fail-closed, deterministic-deny invariants
that ADR-012 deliberately keeps inside the kernel. In other words:
to use the kernel for the thing I built the kernel for, an
application would have to reach around the kernel and re-implement
its most safety-critical contract. That's the gap. It's not
missing validation. It's a missing seam, and the seam is exactly
the part a real consumer needs first.

The consumer that revealed it was BudgetHQ — the product I'm
building on the kernel. BudgetHQ is B2C and B2B at once: individual
users on one side, business workspaces on the other. The B2C path
authenticates fine against a single provider, which is exactly why
the gap stayed hidden — the single-provider validator handles the
common case without complaint. It surfaced the moment I started
planning multi-provider authentication for business workspaces: a
workspace that federates its own employees through its own identity
provider while B2C users keep authenticating through the default
one. The path from Authorization: Bearer … to a populated
PrincipalContext was, for that multi-provider case, simply absent
as a kernel-supported path. The roadmap names it without hedging:
the single largest "ship a B2B SaaS" blocker for the ecosystem. I
only know that because I'm also the one trying to ship the business
workspace.

The decision (RFC, not yet code)

I'll be honest about the status of this one, because it differs
from the HttpClient case and the difference matters. The HttpClient
oversight is fixed and released — ADR-034 and ADR-036 shipped in
kernel 0.8.0, and the seam is in the kernel running today. The
tooling generator that consumes it has since been rewritten to
match; the only piece still in flight is the tooling release that
carries that rewrite, on a track that isn't tagged yet. The IDP
oversight is at a different stage entirely: diagnosed and decided,
but not yet built. The decision went through an RFC rather than
straight to an ADR, because unlike the codec fix it carried
genuinely open strategic choices — and I want to show that
distinction rather than paper over it.

The decision: introduce a dedicated IdentityProvider SPI that
SecurityProvider delegates to. SecurityProvider stays the entry
point but becomes a thin dispatcher — it selects one
IdentityProvider by a cheap issuer/format peek, delegates
validation, then applies the ADR-012 cross-cutting concerns
(isolation-claim → StorageContext, fail-closed semantics)
uniformly to whatever provider was dispatched. Validation becomes
per-provider; orchestration stays central. The first reference
driver is OIDC + JWKS, because the existing CommunityJwksValidator
is already most of the way there — extraction is mostly refactor,
which is the lowest-risk way to ship the first driver. The registry
reads the same way as the codec registries from ADR-034
(of(List<…>), priority-ordered) — deliberately, so that an
identity provider that didn't support pluggable variants would be
the odd one out, not the norm.

That dispatch isn't free either: every request now pays a
pre-validation peek to select a provider before any crypto runs.
Cheap next to signature verification, but it's a second step on
the auth edge the single-validator design didn't have.

There's a sharp failure mode hiding in that design — the kind of
thing that turns a security seam into a security hole. If the dispatcher retried the next provider
after a selected provider's validation failed, a token matching
issuer A but failing A's signature could be "rescued" by provider
B — a federation fail-open. The contract closes it normatively:
selection precedes validation and is separate from it; once a
provider is selected, its failure is terminal, with no fall-through.
That invariant is a mandatory TCK assertion, not a code comment.
The seam I didn't build the first time turns out to have a
correctness property I wouldn't have specified at all if I hadn't
been forced to design the pluggable version.

That's what catching the gap at the seam level instead of in
production buys you: the production version of this lesson is
a fail-open incident, discovered by someone who isn't you, in a
system you can no longer cheaply change. The design-time version is
a TCK assertion in an RFC. Same lesson, vastly different cost.

Identity propagation outbound — how a parsed identity travels to
the next service in a call chain — is a related question with its
own seam (ADR-032's HttpClientRequestEnricher, which propagates
parsed X-Tenant-Id / X-Principal-Id headers rather than
forwarding raw bearer tokens). That part already exists and is
deliberately kept narrow: the kernel holds the parsed identity, not
the raw credential, so it can't leak a token it never retained.
The inbound seam — the IdentityProvider SPI — is the one the RFC
decides and v0.10 implements.

Two Oversights, One Blind Spot

From the outside the two look like different categories of mistake.
One is a codec hard-wired into a facade; the other is a validator
welded into a single provider. Different subsystems, different
consequences, different fixes. But mechanically they're the same
error, and they failed the same way.

The error is the same: I built a concrete instance where the
architecture needed a pluggable seam — Jackson welded into the
client, OIDC welded into one SecurityProvider. Both worked, for
exactly the one case that had a consumer staring at it when I wrote
the code; the shape I'd have needed for the second case was already
authored elsewhere (the server-side encoder SPI) or eighty percent
present (the JWKS validator), so this was never ignorance of the
right design. The failure mode is the same too: both gaps were
invisible from inside the kernel and obvious from the consumer's
side. The defect was never in the code I was reading — it was in
the relationship between that code and a consumer a layer or two
out, and you cannot see a relationship by staring harder at one end
of it.

The thing that actually caught both gaps was occupying the
consumer's position myself. I found the codec gap because I write
the tooling that generates code against the kernel. I found the
validator gap because I'm building a product that runs on the
kernel. In both cases I crossed from the author's side of the
boundary to the consumer's side, read the contract from there, and
the gap was immediately obvious. Not because I got smarter on the
walk across — because the gap is only visible from that side.

That generalizes past me, and not in a flattering direction. On
most teams the person who designs a contract is structurally
prevented from occupying its consumer position. They ship the
contract; a different team consumes it two layers downstream; the
gap surfaces months later as a production incident that nobody can
cheaply fix because the contract has hardened and acquired
dependents. The firefight is not a sign anyone was careless. It's
the default outcome of a division of labour where the designer
never stands where the gap is visible. I caught mine at the design
seam instead of in production for one structural reason, and it has
nothing to do with discipline: I happen to sit on both sides of the
boundary I designed.

I don't think the takeaway is "always be your own consumer" —
that's often not possible, and stated that broadly it's the kind of
advice that's true and useless. The narrow, actionable version is
this: a contract's gaps live at its consumer edge, so someone has
to read the contract from the consumer's side before it hardens.
If that can't be the designer, it has to be a real consumer brought
in early — not a reviewer evaluating the contract on its own terms,
but someone forced to actually build against it. The cheapest
moment to find these gaps is before the contract has dependents.
Once it has them, the same fix costs a coordinated SPI change plus
a migration of every consumer that hardened against the old shape —
which is the state most teams discover the gap in.

The recursion in the HttpClient section is the proof that this
isn't a willpower problem. I left the fourth codec quadrant hard-wired
while actively fixing the other three. I was as alert to this
exact failure mode as I will ever be, and it still got past me,
because the fourth quadrant lived in a consumer I hadn't re-read
yet. You don't beat that by trying harder. You beat it by
arranging to read every consumer end to end before the contract
ships — or by accepting that the ones you don't read are where the
next gap is.

What This Doesn't Solve

The heuristic — read the consumer's code before the contract
hardens — has a boundary I want to be honest about.

It assumes a consumer exists to read. When you're designing a
contract genuinely ahead of any consumer, there's nothing to read
from the other side yet, and "be your own consumer" degenerates
into guessing. The method also isn't free: pulling a real consumer
in early, before the contract stabilizes, slows the design down and
couples two moving targets. I absorbed that cost cheaply because the
consumer and the contract are both mine. A team paying it
deliberately is making a real trade, not collecting a free win.

And it doesn't close the identity story. The IdentityProvider
seam is inbound only — token to PrincipalContext. Outbound
service-to-service identity past parsed-identity headers (token
exchange, on-behalf-of, client-credentials) is reserved in the RFC
as a future OutboundCredentialProvider seam, not built. A
zero-trust mesh that needs the kernel to mint a downstream
credential is exactly the case this design names and then defers.

What's Next

Concretely, on the kernel:

The HttpClient codec SPI is shipped and released. ADR-034
(client-side request encode + response decode, KernelWebClient
facade) and ADR-036 (server-side request decode, completing the
2×2 matrix) both landed in kernel 0.8.0; the four-quadrant seam is
in the kernel running today, 0.8.1. The fix wasn't housekeeping —
it was a precondition. The tier-neutral generator couldn't be
written while the codec was welded into the facade, so the seams
the kernel shipped in 0.8.0 are exactly what unblocked the generator
that needed them. Fixing the contract in the released kernel cleared
the next layer up. I didn't fix the codec in a vacuum — I fixed it
because the thing I was building next required it.

And the next layer followed. The generators that consume those
seams — KernelHandlerGenerator, KernelClientGenerator — have
since been rewritten to use them: the handler resolves the request
body through the decoder registry instead of an inlined Jackson
call, and the client emits the tier-neutral KernelWebClient,
which also closes the original tier-name leak this article opened
with. That rewrite is merged on the tooling's main line. What's
still pending is narrower than the rewrite — it's the tooling
release itself: the tooling sits at 0.5.0-SNAPSHOT with no tag
cut, a deliberate hold whose next tag waits on the Capabilities/SKU
milestone, the same way the kernel tagged 0.8.0 only once its seams
were ready. The loop is closed in source — kernel seam released,
consuming generators rewritten — and the only thing still sequenced
is the release that ships it.

The IdentityProvider SPI is decided and reserved — the direction
is locked by RFC (dedicated SPI, SecurityProvider as dispatcher,
OIDC-first reference driver, fail-closed terminal selection). The
ADR that locks the detail and the implementation both land in
v0.10. A load-bearing dependency lands first: JWKS key rotation
with an overlap window, in v0.9, which the OIDC driver then
consumes.

Neither fix is the interesting part. The interesting part is the
question I now run against every contract before I call its design
closed: who consumes this, and have I read their code — not my
code — end to end? It's a weak question. It doesn't sound like
architecture. But it catches the specific class of gap that strong
architectural thinking is structurally blind to, and after two
instances of the same blind spot I trust the weak question more
than I trust my own confidence that a design is complete.

The codec SPI discussed here lives in exeris-kernel-spi under eu.exeris.kernel.spi.http; the KernelWebClient facade is in exeris-kernel-core. The security seam (SecurityProvider, PrincipalContext, StorageContext) is in exeris-kernel-spi under eu.exeris.kernel.spi.security. The decision records — ADR-034, ADR-036, ADR-032, and the IdentityProvider RFC — are in the kernel docs tree:
🔗 exeris-systems/exeris-kernel

ByteBuffer Solves Half the Problem: The LoanedBuffer Pattern

Arkadiusz Przychocki — Thu, 28 May 2026 08:15:56 +0000

This is the seventh article in the Exeris Kernel series.

TL;DR: Direct ByteBuffer gives you zero-copy but defers cleanup to the GC.
Arena gives you deterministic cleanup but scopes ownership to a single region.
Neither models a buffer shared across threads with a lifetime longer than any
single fork. LoanedBuffer is the third option Exeris needed: explicit
reference counting, try-with-resources for the boring case, and EX-MEM-1003
when discipline fails. The cost is honest - the compiler will not catch a
missed retain() before a fork().

You can move TLS off the heap. You can ban ThreadLocal
and replace it with ScopedValue. You can structure your concurrency with
StructuredTaskScope. You can
push the TLS boundary into Panama FFM
so that cipher operations no longer allocate. And you will still find allocation
pressure on the hot path - because somewhere, something is moving bytes through a
byte[].

This article is about that boundary.

It is also where one specific JVM-era assumption finally breaks: that direct
ByteBuffer is "the off-heap one." Direct ByteBuffer solves zero-copy. It does
not solve ownership. In a runtime where memory lifecycle is supposed to be part of
the architecture, those are not the same problem.

The Constraint

When I started designing the memory subsystem for Exeris, I had two constraints
that had to hold simultaneously on the request hot path:

Zero copy. Bytes coming off a socket, through TLS, through HTTP framing, into a request handler - none of those steps may allocate a new array and copy.
Deterministic cleanup. When work on a buffer is done, native memory must be released now, not whenever the GC notices.

Most JVM memory abstractions give you exactly one of these.

byte[] gives you neither. Heap allocation, GC-managed lifecycle, and a copy every
time you cross a native boundary.

ByteBuffer.wrap(byte[]) is a heap buffer with a different name. Same problems.

ByteBuffer.allocateDirect(n) gives you (1) but not (2). The segment is off-heap, so
crossing into native code does not require a copy. But the underlying memory is freed
when a Cleaner thread observes that the ByteBuffer reference has become unreachable.
Under load, this means buffers can survive arbitrarily long after the work is done.
You do not control when. You cannot ask. There is no close().

Arena.allocate(layout) from Panama FFM gives you (2) but with a coarser ownership
model. An Arena owns a region of memory; closing the arena releases everything in
it. That is fine for a request lifetime. It is less fine when a buffer is shared
between threads, or transferred from one task to another and released by the second,
or part of an in-flight queue.

I needed both. And I needed them composable.

ByteBuffer Solves Half the Problem

What stopped me from using ByteBuffer directly was not API ergonomics. It is
that ByteBuffer does not have an ownership model.

It has an access model - position, limit, capacity, slice, duplicate - but
nothing that answers the question "who is responsible for releasing this memory,
and when?". Direct buffers defer that question to the GC. Heap-backed buffers
defer it to the GC twice - once for the buffer object, once for the array it wraps.

That works fine when allocations are infrequent and lifetimes are obvious. It
breaks when bytes are flowing through 1-VT-per-stream concurrency at network
speed and a Cleaner thread is the only thing standing between you and a slow,
silent native heap leak.

The standard JVM workaround is buffer pooling: keep a ConcurrentLinkedQueue of
direct ByteBuffer instances, hand them out, and trust callers to return them.
This works in practice and underpins frameworks like Netty. It also reintroduces
the exact problem the GC was trying to solve: explicit lifecycle management, with
the additional twist that forgetting to return is now silent - the buffer just
sinks into orphan memory without a Cleaner event to notice.

What I wanted was the lifecycle discipline of an Arena combined with the
flexibility of a pooled buffer - and a way to share ownership across threads
without either a lock or a leak.

That is what LoanedBuffer is. The rest of this article is what it cost.

The Loan Pattern

The basic shape is unsurprising. LoanedBuffer implements AutoCloseable. You
allocate via the SPI. You use try-with-resources:

try (LoanedBuffer buffer = allocator.allocate(AllocationHint.MEDIUM)) {
    buffer.writeBytes(payload, 0, payload.length);
    transport.send(buffer);
}

Three things are happening here that ByteBuffer does not give you.

First, the allocator is injected via SPI. The application code does not know
whether the underlying allocator is a slab pool, a partitioned arena, or a
specialized native pool optimized for a specific transport. Implementation
blindness is preserved - Core operates exclusively on the MemoryAllocator
contract, resolved at bootstrap via ServiceLoader:

MemoryAllocator allocator = ServiceLoader.load(MemoryAllocator.class)
        .findFirst()
        .orElseThrow(() -> new KernelBootstrapException(KernelErrorCodes.EX_BOOT_0002));

Second, AllocationHint is a typed enum, not a raw byte count. The hint tells
the allocator which size class is wanted (SMALL, MEDIUM, LARGE,
NETWORK_FRAME). The allocator picks a slab from the matching pool. There is
no math at the call site, no rounding, no "did I just trigger a slow path?".

Third, close() is deterministic and immediate. When the try block exits,
the slab returns to its pool now. The watermark manager updates now. There
is no Cleaner thread, no PhantomReference, no waiting for GC.

That is the boring, single-owner case. The pattern earns its name in the next
case - the one ByteBuffer cannot model at all.

Reference Counting with VarHandle

Inside the kernel, a buffer often has more than one logical owner. The transport
layer wants to hold it while the request handler is reading. The handler wants to
hold it while async work is in flight. The async work might want to retain it for
a follow-up operation.

LoanedBuffer solves this with explicit reference counting. Allocation starts
the count at one. retain() increments. close() decrements. When the count
reaches zero, the slab returns to the pool.

The implementation uses VarHandle for the CAS path. No synchronized, no
AtomicInteger allocation per buffer, no monitor inflation. The reference count
is a primitive int field on the buffer itself, accessed through a
class-level VarHandle:

public final void retain() {
    REF_COUNT_HANDLE.getAndAdd(this, 1);
}

public final void close() {
    int previous = (int) REF_COUNT_HANDLE.getAndAdd(this, -1);
    if (previous == 1) {
        fireCloseActions();
    }
}

Calling close() more than the buffer was retained is a fatal contract violation.
Calling retain() on a non-owning view - for example, a peek() slice that
exposes a memory region without transferring ownership - is also a contract
violation. The kernel emits EX-MEM-1003 (Peek View Ownership Misuse) as a
glass-box telemetry event when this happens, with the calling method captured in
rawArgs[0]. The call itself is a no-op: it neither increments the count nor
returns silently. It is logged and refused.

The point of refusing is not to be punitive. It is that an unobserved
retain()-on-view bug becomes a use-after-free somewhere else, on a different
thread, at an unpredictable time. Failing fast and loudly at the misuse site
makes the bug local instead of distributed.

The Async Transfer Problem

This is the case that motivated the design. I never had to debug it in
production - I caught it on paper while sketching the ownership model, and
the design followed from there.

A request arrives. The handler reads it into a buffer. The handler then forks two
async sub-tasks under a StructuredTaskScope: one to validate, one to enrich.
Both sub-tasks need to read the same buffer. The handler joins both, then
serializes the response.

In the standard JVM model, this is a sharing problem with no good answer. If you
pass a ByteBuffer to two virtual threads, you have just created an aliasing
hazard with no concurrency model. If you copy the buffer twice, you have just
defeated zero-copy.

In the LoanedBuffer model, sharing is explicit:

try (var scope = StructuredTaskScope.open(Joiner.awaitAllSuccessfulOrThrow())) {
    try (LoanedBuffer buffer = allocator.allocate(AllocationHint.NETWORK_FRAME)) {
        buffer.retain();

        scope.fork(() -> {
            try {
                return processAsync(buffer);
            } finally {
                buffer.close();
            }
        });

        scope.join();
    }
}

The pattern is:

The allocator returns the buffer with refCount = 1.
Before forking, the parent calls retain(). Count is now 2.
The parent forks the sub-task. The sub-task runs concurrently.
The sub-task closes the buffer when done. Count drops to 1.
The parent's try-with-resources closes when the outer block exits. Count drops to 0. Slab returns to the pool.

This is dependency-safe because retain() happened before the fork. If a
caller forgets the retain(), the parent's close() can race the sub-task's
read, and the sub-task observes a slab that has already been recycled. The kernel
catches this in its TCK suite, but the contract is the caller's to honor - there
is no automatic retain on fork. I considered making scope.fork() automatically
retain the buffer if a special wrapper type was passed in, but the cost was
introducing a parallel API surface for what is fundamentally a discipline issue.
The current design keeps the rule visible at the call site: if you fork it, you
retain it first.

This is also the place where the STS-bootstrap article's pattern pays off
again. There, the structured scope owned a startup round - a bounded unit of
work with a clear lifetime. Here, the structured scope owns a fan-out unit with
the same clarity, but with an additional resource - the buffer - whose lifetime
is longer than any single fork and shorter than the enclosing scope. The
ownership model has to support that.

StructuredTaskScope does not. LoanedBuffer does.

The JMM contract underneath this is worth stating directly because it is easy
to get wrong. There are no explicit memory fences in the close-action path. The
visibility of close-action slots written by the allocating thread to the
releasing thread is guaranteed by safe publication of the buffer reference itself.
When the parent passes the buffer into scope.fork(), the JDK's structured-scope
implementation publishes the reference safely - that publication is what makes
all the buffer's fields visible to the sub-task, including the close-action
chain. If you bypassed scope.fork() and handed the buffer to another thread
through, say, a non-volatile field, the model breaks.

This is also why the Community transport's allocator uses shared-arena semantics
for all allocations rather than Arena.ofConfined(). The carrier thread
allocates, but the per-stream Virtual Thread closes - different threads, same
buffer. A confined arena would refuse the cross-thread close(). Shared arena
allows it, with the JMM safely-published buffer reference carrying visibility.

Watermarks and the Pressure Boundary

LoanedBuffer solves per-buffer ownership. It does not solve aggregate pressure.

When the slab pools start running low, the kernel needs to know - and decide
what to do about it - before an EX-MEM-1001 (Off-heap Exhausted) gets
thrown on a request hot path. That is the job of WatermarkManager.

The manager exposes four threshold levels:

Level	Off-heap utilization	`ResourceArbiter` decision
`NORMAL`	< 70%	`ALLOW` - allocations proceed
`WARNING`	70–85%	`THROTTLE` - large allocations rejected
`CRITICAL`	85–95%	`REJECT` - only essential traffic
`SHEDDING`	≥ 95%	`SHED_LOAD` - `EX-MEM-1001` thrown

The ResourceArbiter reads the current level on each allocation request:

public LoanedBuffer tryAllocate(AllocationHint hint) {
    if (watermark.isHighWatermarkBreached()) {
        throw new MemoryExhaustedException(hint.bytes(), watermark.availableBytes());
    }
    return allocator.allocate(hint);
}

This is where LoanedBuffer connects forward to the next architectural layer.
The watermark levels are not just internal accounting - they expose pressure as a
typed signal (WatermarkLevel) that the rest of the kernel - scheduling,
admission, business logic - can react to without inspecting GC counters or
parsing JFR events at runtime. How the transport edge uses that signal to shed
load before work hits the kernel is a separate decision and belongs to its own
article.

Leak Detection: When Discipline Fails

The Loan pattern relies on discipline. Every allocation must be paired with a
close(). Every retain() must be paired with another close(). There is no
GC fallback.

In production, that discipline is enforced by the API surface - try-with-resources,
sealed types, the Glass Box telemetry of EX-MEM-1003. In development and
testing, it is enforced by LeakTracker, which integrates java.lang.ref.Cleaner
to detect buffers that became unreachable without being closed.

When LeakTracker runs in PARANOID mode and observes a LoanedBuffer whose
reference count is non-zero at GC time, it emits EX-MEM-1002 (Arena Leak Detected):

Code	Meaning	Glass-Box Payload (`rawArgs`)
`EX-MEM-1001`	Off-heap Exhausted	`[0] long requestedBytes, [1] long availableBytes`
`EX-MEM-1002`	Arena Leak Detected	`[0] long segmentAddress, [1] long segmentByteSize`
`EX-MEM-1003`	Peek View Ownership Misuse	`[0] String callerMethod`

The leak is logged with the segment address and size, and a JFR event is
emitted. In a long-running test, this turns "I forgot a close somewhere" into a
specific, actionable signal with a stack trace.

This does not catch all leaks. A reference held by a long-lived data structure
will not be GC'd, and LeakTracker will not fire. The terminalStateCatalog
discipline I described in the Flow article
applies here too: long-lived in-memory caches need their own bounded retention
policy. The pattern catches forgotten references, not intentionally retained
ones.

The trade-off is honest. LeakTracker is a development and staging tool, not a
production safety net. In production, the API surface and code review are the
primary defense. In development, PARANOID mode is the difference between
"there is a leak somewhere in 50k LOC" and "the leak is in OrderHandler.java
line 142, allocated from NetworkFrameSlabPool, 4096 bytes."

What Still Remains True

A few things stay true even after this model is in place. Some of them are the
reasons not to adopt it.

ByteBuffer is still the right answer for most Java applications. If you are
building a normal HTTP service and your bottleneck is not allocation pressure
on the request hot path, the Loan Pattern is over-engineering. It costs
cognitive load on every read path, every fork, every cross-thread handoff. That
cost is justified by the constraint, not by aesthetics.

Arena is still useful for request-lifetime allocations that do not need
sharing. Inside a single Virtual Thread, with a clear scope boundary, an
Arena.ofConfined() is simpler than a refcounted buffer. The kernel uses both
patterns where each fits.

GC is still your friend for object graphs. Nothing in the Loan Pattern says
"never allocate on the heap." The pattern is specific to off-heap memory on
the request hot path. Domain objects, plan instances, log records - all of
those still live where Java has always put them.

The pattern does not solve cross-process IPC. If a buffer leaves the process

shipped over a network, written to a shared memory file, handed to a different JVM - the reference count stops being meaningful. The LoanedBuffer model is correct only for in-process lifetime. The handoff to a different process is its own boundary problem with its own ownership semantics.

Finally, the Loan Pattern does not soften the discipline cost. Every fork must
retain. Every share must retain/close. Every async path must close in finally.
The compiler will not catch the omissions for you. The code review will. The
TCK will. LeakTracker will, in development. The runtime will not.

I considered making this less explicit - a wrapper type that auto-retained on
escape from a method, an annotation that made the compiler enforce paired
close(). Each of those would have either added runtime overhead, added a
parallel API, or relied on a static analyzer that did not exist. The current
design accepts the discipline cost as the price of the architectural model.

The thing I keep coming back to is that this is not a clever data structure.
It is a contract - who owns this memory, who is allowed to extend its
lifetime, and what is observable when someone gets it wrong. The implementation
is unremarkable: a VarHandle, an int, a close-action chain. The work is
deciding that ownership belongs in the type system at all, and accepting that
the discipline cost is the price of the model.

The next architectural decision in the series is what the kernel does with the
pressure signal once it has one - how WatermarkLevel becomes a shed decision
at the network edge, and the single place in the kernel where unstructured
Virtual Threads are deliberately allowed.

Explore the Exeris Kernel - zero-allocation architecture in running code:
🔗 exeris-systems/exeris-kernel

The Memory subsystem lives in exeris-kernel-spi (MemoryAllocator,
LoanedBuffer, AllocationHint) and exeris-kernel-core
(AbstractLoanedBuffer, WatermarkManager, ResourceArbiter, LeakTracker).
If you want to see how the refcount / retain() / close() contract behaves
under cross-thread fork-and-join, the TCK suite in exeris-kernel-tck is the
fastest way in.

What you measure depends on where you draw the boundary

Arkadiusz Przychocki — Thu, 14 May 2026 09:56:13 +0000

Benchmark metadata — scenario: e2e-shop-order-saga · hardware: dev-laptop · jdk: 26 · date: 2026-05-05 · status: descriptive · reproducibility: complete

The previous article in this series — Where StructuredTaskScope Ends — argued the architectural case for building a native Flow engine instead of adopting an existing saga framework. This is the empirical sequel: what those frameworks actually cost when you measure all three saga guarantees, not just forward progress.

I ran the comparative benchmark expecting latency differences. Spring + Axon did post the fastest p95 — but that surprised me less than something else in the same table. Under matched 3% failure injection, both Axon-based stacks reported zero compensations. Not "fewer than expected." Zero. That's what made me stop and look harder.

Saga frameworks are evaluated on throughput and latency. That's
incomplete, and sometimes misleading. Saga has three guarantees:
forward progress, compensation under failure, and termination.
Most benchmarks measure forward progress only — and then call the
framework "correct" if the happy path works.

Here's what happens when you measure all three.

The setup

Three saga frameworks. One identical scenario contract. Same
payload, same VU count, same think time, same payment_fail_rate
(3% configured failure injection), same JDK 26, same dev-laptop
hardware, same wire protocol (HTTP/1.1, loopback — all three apps
were configured for h2c, but k6 negotiated HTTP/1.1 in every run;
the protocol was identical across stacks regardless).

Stack	Orchestration model
Exeris (open-core)	Native `Flow` engine — synchronous saga execution, in-process state machine
Quarkus 3 + Axon Framework + Neo4j	Async event-sourced saga via Axon Server (separate process)
Spring Boot 3 + Axon Framework + Neo4j	Same Axon model, different host runtime

Difference: orchestration model, not problem domain. Same scenario
contract on every side.

Two terms used throughout the article need their definitions made
explicit. dev-laptop here is an AMD Ryzen 5 5600 (6 cores /
12 threads), 32 GB DDR4, running Linux with a full graphical
desktop environment, with all benchmark components — k6, the
three target applications, and Axon Server where applicable —
co-located on loopback. perf-box-amd64, the next milestone
for comparative claims, is baremetal with real WAN between the
load generator and the target application. The CPU may or may not
be faster than the dev-laptop — that isn't the point. The point
of the perf-box re-run is removing the localhost shortcut and the
desktop-environment noise, and exposing the saga path to honest
network latency and operational shape.

The benchmark contract — e2e-shop-order-saga — exercises a
five-step order saga: register customer → recommend products → add
to cart → get cart → create order. Payment failure is injected
at 3% rate. Compensation must roll back stock reservation and
release the order.

Reproducibility metadata lives in the published result.json for
every run:

{
  "scenario_id": "e2e-shop-order-saga",
  "contract_id": "exeris_community_h2c_v1",
  "hardware_profile": "dev-laptop",
  "jdk_version": "26",
  "claim_scope": "exploratory",
  "reproducibility_status": "complete"
}

Every published number carries claim_scope, hardware_profile,
transport_mode, and reproducibility_status. Hard fairness
gates reject runs that fail equivalence checks. Comparative claims
require matched scenario contract on both sides — everything else
is labeled descriptive_only or exploratory. This is how every
claim below was evaluated, including my own.

→ scenario contract
→ raw artifacts

The numbers that look reasonable until you read them

Metric	Exeris	Quarkus	Spring
http_req_duration p50	4.59 ms	4.18 ms	3.10 ms
http_req_duration p95	16.6 ms	30.3 ms	14.6 ms
iteration_duration p95	9.18 s	8.49 s	9.28 s
Saga success rate	96.7%	98.2%	98.8%
Compensation rate (cfg: 3%)	3.32% ✓	0% ✗	0% ✗
Saga unresolved rate	0% ✓	1.82% ✗	1.22% ✗
App peak RSS	459 MB	752 MB	1,312 MB
App peak threads	66	94	81
App CPU time	24.7 s	22.3 s	33.3 s
Axon Server RSS (separate proc)	—	~787 MB	~850 MB
Axon Server PIDs	—	~120-126	~133-148

Look at the top of that table. Spring Boot 3 + Axon shows the
fastest p50 and p95. Quarkus and Exeris are in the same band.
By the standard mental model — the one most performance posts
are written under — Spring wins.

Now look at the bottom.

Both Axon-based stacks ran the same configured 3% payment
failure rate as Exeris. Exeris compensated 3.32% of sagas,
matching the configured rate within statistical noise. Both
Quarkus and Spring Boot reported 0% compensations.

Where did the failures go? Into "saga unresolved" — sagas that
never reached a terminal state before the test window closed.
1.82% for Quarkus, 1.22% for Spring.

This isn't a performance gap. It's a correctness gap. And the two
Axon-based stacks share it for the same reason.

What you measure depends on where you draw the boundary

Insight

The dropped compensations and the "fast" latency are not two facts.
They are one fact: a framework that returns before work is done
will always show both.

Here's what's structurally happening.

Exeris-native Flow runs the saga inline. The order endpoint
returns when the saga state machine has progressed — when a
compensation has actually executed, the response carries that
fact. The 16.6 ms p95 you see in the table measures actual saga
progress, including compensation work on the failure paths.

Spring + Axon dispatches the order command to Axon Server
asynchronously. The endpoint returns 202 Accepted as soon as the
event is published. The saga state machine continues running in
a separate process, with no one waiting for it on the request
path. The 14.6 ms p95 measures the time to publish to Axon
Server. Not the time to do the work.

Quarkus + Axon has the same architecture. Same async dispatch.
Same illusion.

This is also exactly why both Axon-based stacks drop
compensations. The saga hasn't finished by the time the k6 test
window closes. It's still running in Axon Server, async, with
no one polling. The benchmark observes "0% compensations" because
the compensation work hasn't run yet, not because no failures
occurred. The 1.2–1.8% sagas in non-terminal state are the
fingerprint.

The same boundary problem applies to every other resource on
that table:

Memory:
  Spring app 1.3 GB     +  Axon Server ~850 MB  =  ~2.16 GB
  Quarkus app 752 MB    +  Axon Server ~787 MB  =  ~1.54 GB
  Exeris all-in 459 MB                          =     459 MB

Threads:
  Spring app 81         +  Axon Server ~140    =  ~221
  Quarkus app 94        +  Axon Server ~123    =  ~217
  Exeris all-in                                 =    66

CPU time:
  Spring app 33 s       +  Axon Server ~52 s   =  ~85 s
  Quarkus app 22 s      +  Axon Server ~26 s   =  ~48 s
  Exeris all-in                                 =    25 s

Each of those metrics tells the same story. Drawing the boundary
tightly around the application process makes everything look
reasonable. Drawing it around "what does it take to run a saga
end-to-end?" — every metric is dominated by what's outside the
application process.

This is the real architectural cost of async dispatch: not just
the dropped compensations, but the second process you have to
feed CPU, RAM, threads, network, and operational attention.
Exeris-native Flow chose the in-process path. That choice is
also why it correctly compensates.

Two host runtimes, same defect

Two unrelated host runtimes (Quarkus 3 and Spring Boot 3) share
the same correctness defect. That rules out host-runtime quirks.
The shared cause is Axon's event-sourced async dispatch model
colliding with a benchmark window of finite length.

Axon's SubscribingEventProcessor (the variant used here) processes
events on a thread distinct from the dispatcher. Saga state
transitions land on that processor thread. Status reads — what k6
polls to determine "did this saga complete?" — go through Axon's
in-memory ConcurrentHashMap projection, which is updated only
after the event handler runs.

When the benchmark window ends, in-flight events that haven't
been processed are still in the queue. The saga's state never
reaches COMPLETED or COMPENSATED from the test's vantage point.
k6 sees a non-terminal state and counts it as saga_unresolved.

I want to be clear about what this is and isn't. It's not a bug
in Axon — it's the design. Async event-sourced sagas are designed
to eventually reach terminal state, given enough time. The
benchmark window — 180 seconds — is finite. The saga's
"eventually" doesn't fit in that window for some percentage of
failed sagas.

Exeris-native Flow has no such gap because the saga state machine
runs synchronously on the request VT, with off-heap state
persisted before the HTTP response. When the response says "saga
complete", the saga is complete. When the response says "saga
compensated", the compensation has executed.

You can verify this by inspecting the JFR snapshots that
accompany every benchmark run. The Exeris JFR shows compensation
events on the request VT timeline, completing within the request.
The Axon JFRs show event handler invocations that don't always
complete before the request VT exits.

Where structured concurrency helps — and where it stops

A saga isn't a fork-join problem. STS (Structured Task Scope)
gives you bounded fork-and-join, structured cancellation,
explicit error propagation. Useful primitives, but not enough on
their own:

STS does not give you state persistence between handler invocations
STS does not give you compensation queue durability across crash
STS does not give you idempotency for retry-after-restart

Exeris-native Flow combines STS for in-flight orchestration with
an off-heap flow state machine and an outbox for crash-safe
compensation. STS is half the saga story, not the whole one.

The Axon model is the inverse: it solves persistence and
durability (Axon Server holds events) but trades synchronous
correctness for async throughput. That trade is fine — until you
benchmark with finite windows and configured failure injection.

Where this still doesn't generalize

Because rigor matters more than mass distribution, here's what
the data above does and doesn't support:

Hardware: dev-laptop, not perf-box. Re-run on perf-box-amd64 is on the roadmap before any comparison_eligible claim is published.
Failure rate: configured 3% only. Extreme failure rates (10–30%) and long-tail rare failures not yet validated.
Axon variant: Axon Server architecture, not embedded EventStore. The embedded variant may behave differently.
Graph backing: Neo4j Bolt driver in path; PG-only variant not yet measured.
Window: 180s. Not enough for observation of long-tail compensation. A 1800s and longer windows would likely shift the numbers for both Axon stacks toward better compensation rates — but that's a separate experiment.

Each of these is a known scope limit, published as next milestones
in the public benchmark suite roadmap.

Reproduction

The whole thing is reproducible. The published artifacts include:

scenarios/e2e-shop-order-saga/
  scenario.json                          ← protocol contract
  comparative-pair-manifest.json         ← what runs comparable
  seed/
    seed-manifest.json                   ← deterministic seed data
    verify-seed.sh                       ← seed verification script

results/raw/e2e-shop-order-saga/
  20260505T115008Z-baseline/             ← Quarkus + Axon run
    result.json
    target-quarkus-app-axon-*.jfr
    logs/axonserver-docker-stats.csv
  20260505T115722Z-baseline/             ← Exeris run
    result.json
    target-exeris-community-app-*.jfr
  20260505T120906Z-baseline/             ← Spring + Axon run
    result.json
    target-spring-app-axon-*.jfr
    logs/axonserver-docker-stats.csv

scripts/
  run-e2e-shop-order-saga-campaign.sh    ← multi-target reproduction (full comparison)
  run-e2e-shop-order-saga-baseline.sh    ← single-target run (called by campaign)

A full reproduction of the three runs above is one invocation of
the campaign script:

./scripts/run-e2e-shop-order-saga-campaign.sh \
  --targets exeris-community-app,quarkus-app-axon,spring-app-axon \
  --graph-track neo4j \
  --repeats 1

Anyone can rerun this on their own hardware and verify the
correctness asymmetry. That's the only thing that matters for
this kind of claim.

What I trust about this, and what I don't

I trust the compensation correctness asymmetry. It reproduces.
It has a mechanical explanation. It shows up in two unrelated
host runtimes for the same Axon model. The claim is structurally
defensible: async-dispatch saga frameworks return before work is
done; that's why the latency looks fast and the compensations
go missing.

I trust the full-system footprint comparison (memory, threads,
CPU). The Axon Server process is real, has a measurable cost,
and that cost belongs in the comparison. It's not a critique of
Axon — Axon Server is doing its job. It's a critique of
benchmarks that draw the boundary tightly around the application
JVM and forget the orchestration backend.

I don't trust the raw latency numbers as comparative evidence.
They're descriptive only. p50 and p95 differences within the
3–10 ms range on dev-laptop, single-tenant loopback, are noise
relative to the 1.5–4× range that would matter for production
deployment decisions. The interesting story isn't "Exeris is
faster than Spring" — because Spring's 14.6 ms is dispatch time,
not work time, so the comparison doesn't even type-check.

The numbers I trust most are the ones I'm most afraid to publish
— because they include what's not yet validated. That's also why
they're labeled descriptive and exploratory, not
comparison_eligible.

The next milestone closes the hardware gap (perf-box-amd64
re-run) and explores wider failure rates. The data above will
either hold up or fall apart. Either is fine — that's why it's
public.

Reproducibility

Run it yourself:
github.com/exeris-systems/exeris-benchmarks
— scenario contract, raw artifacts, and reproduction scripts.
JDK 26, Docker Compose, k6. Single command, single contract.
Disagree with what I measured? Open an issue with your numbers.

The Flow engine producing the compensation correctness above is in exeris-kernel-core and exeris-kernel-spi. The TCK covering compensation under crash-recovery is in exeris-kernel-tck:
🔗 exeris-systems/exeris-kernel

Where StructuredTaskScope Ends: Building the Flow Layer in Exeris

Arkadiusz Przychocki — Thu, 07 May 2026 09:59:58 +0000

This is the fifth article in the Exeris Kernel series.

The series has been building one coherent architectural picture:
article one replaced ThreadLocal
with ScopedValue on the context propagation path;
article two argued that some layers do not benefit
from runtime polymorphism;
article three showed where
StructuredTaskScope genuinely earns its keep;
article four pushed TLS off the heap entirely.

This one is about where STS stops being the right tool — and what had to be built instead.

The Constraint I Kept Hitting

The pressure came from a specific place: I was building the orchestration layer for
Exeris Kernel — the part that drives multi-step business processes with compensation
and external event wiring.

The obvious starting point was an existing enterprise saga framework. I considered
Camunda, Axon, and several others in that category, and rejected all of them. The
flexibility the kernel's other constraints demanded — zero-allocation hot paths,
ScopedValue propagation, off-heap state — was not something I could retrofit onto a
framework whose orchestration model was already opinionated. Each also meant a separate
operational process — Axon Server, Camunda engine — with its own JVM, memory footprint,
and lifecycle. The cost-efficiency dimension I only quantified properly later, but
the operational overhead was visible from day one.

So I went one level lower. At first, StructuredTaskScope seemed like a natural fit.
You fork work, you join it, you get structured lifecycle guarantees. Project Loom made
this cheap and clean.

Then I hit a step that needed to wait for a payment confirmation.

Not for a few milliseconds. Not until a downstream service responds synchronously.
The step needed to yield execution, persist its state, and resume when an external
event arrived — possibly hours later, possibly after a JVM restart.

StructuredTaskScope has no model for this. It is structured in space — all forked
tasks are bounded to a scope that lives on a single call stack. It is not structured
in time. The moment your execution boundary needs to span time rather than scope,
you have left StructuredTaskScope territory entirely.

That is the constraint that decided the architecture.

What STS Still Does Inside Flow

Before going further: StructuredTaskScope is not replaced. It is still used in
several places throughout the kernel — and not always for what most introductions
to STS show.

The first use is the obvious one. Inside a single Flow step, when an action needs
to call two independent services and merge the results before returning
FlowOutcome.CONTINUE. The execution is bounded to that step's Virtual Thread.
STS owns the fan-out; Flow owns the step boundary. Standard structured concurrency.

The second is at L3, inside the Events subsystem. InMemoryEventBus.publishAndAwait
opens a scope, forks one Virtual Thread per registered handler, joins, and returns
once every handler has finished. CommunityEventLoop.dispatchBatch does the same
shape on a different cadence: per drained batch of events, fork one VT per registered
batch processor, join when all are done. Both are clean ad-hoc fan-outs with a
deterministic join point — exactly the case STS was designed for.

The third is the most subtle. OutboxOrchestrator.ownerLoop opens a scope and forks
exactly one task — its long-lived poll-and-flush loop. Why use STS for a single fork?
Because Java 26 enforces an owner-thread rule: open(), fork(), join(), and
close() must all happen on the same thread. The orchestrator spawns a dedicated
owner Virtual Thread specifically to satisfy that constraint, so the lifecycle scope
of the loop is explicit and structured rather than implicit and tracked across threads.

The point where Flow takes over is the inter-step boundary — the moment a step
returns something other than "I am done, continue." Below that boundary, STS
remains the right tool, in all three patterns above. Flow picks up where the
call stack ends.

The Execution Model

The execution model came from somewhere I didn't expect. Having ruled out the
enterprise saga frameworks and dropped one level lower into STS, I still needed
to decide what the saga state machine itself should look like. The first analogy
that came to mind was the TLS state machine I'd just been working on (described
in article four). TLS is one
of the most rigorously specified state machines in widely-deployed software —
states explicit, transitions deterministic, I/O bounded to state boundaries.
Saga orchestration, structurally, has the same shape. I tried imitating the TLS
pattern. It fit, and it stayed.

Each Flow instance runs on its own Virtual Thread, launched by CoreFlowRuntime:

// From CoreFlowRuntime.launch()
Thread thread = Thread.ofVirtual()
    .name("exeris-flow-"
          + instance.key().instanceIdMost()
          + '-'
          + instance.key().instanceIdLeast())
    .unstarted(() -> runInstance(instance, startStep));
runningThreads.add(thread);
thread.start();

That Virtual Thread executes steps in a loop until one of three things happens:

// From CoreFlowRuntime.applyOutcome()
return switch (outcome) {
    case CONTINUE -> applyContinueOutcome(instance, step, stepIndex);
    case COMPLETE -> {
        applyCompleteOutcome(instance, step);
        yield -1;
    }
    case PARK -> {
        applyParkOutcome(instance, stepIndex);
        yield -1;
    }
    default -> -1;
};

CONTINUE — advance to the next step, stay on the same Virtual Thread.
COMPLETE — short-circuit directly to terminal state, bypassing remaining steps.
PARK — this is the boundary. The Virtual Thread exits. State is persisted. Execution
is suspended until an explicit wake() call.

That PARK path is exactly what has no equivalent in StructuredTaskScope. When
applyParkOutcome runs, it serializes the current FlowSnapshot — step index,
compensation stack, state, timeout — and registers the instance in the parked map.
A wake() call later resolves that snapshot, either from in-memory or from the
FlowSnapshotStore, and launches a new Virtual Thread that resumes from
currentStep + 1.

Defining a Flow

Before a flow can run, it must be compiled into an execution plan. The builder API
is intentionally explicit — you declare steps, their compensation actions, and
any non-linear transitions:

FlowExecutionPlan orderFulfillment = engine.plans()
    .newDefinition("order-fulfillment")
    .step("validate-order",
          ctx -> {
              validateOrder(ctx);
              return FlowOutcome.CONTINUE;
          },
          ctx -> { /* no rollback needed for validation */ })
    .step("charge-payment",
          ctx -> {
              initiatePayment(ctx);
              return FlowOutcome.PARK; // wait for async confirmation
          },
          ctx -> refundPayment(ctx))
    .step("ship-order",
          ctx -> {
              dispatchShipment(ctx);
              return FlowOutcome.COMPLETE;
          },
          null)
    .build();

engine.plans().compile(orderFulfillment);

compile() validates the step graph, builds the adjacency matrix and the nextStep
index array, and registers the plan in the planCatalog. After this point, the engine
can schedule instances of this definition.

The trade-off here is deliberate: the definition is closed-world at compile time.
Adding or reordering steps after in-flight instances exist is a schema migration problem
— and the engine does not solve it silently. If an in-flight Saga was parked on a step
that no longer exists after a redeployment, EX-FLOW-7002 with SCHEMA_MISMATCH is
thrown on wake. The safe pattern is blue/green with Saga drain before switching traffic.
I kept this constraint explicit rather than hiding it behind version-transparent routing,
because transparent versioning that does not actually guarantee correctness is worse than
visible friction.

When the Wake Comes from Outside

Park/wake with a direct scheduler.wake() call covers the case where you control both
sides. The harder case is choreography: the flow should resume when an event arrives
from an external system — a payment gateway, a warehouse service, a user action.

This is where FlowChoreographyBridge and the sealed ChoreographyDecision
type come in. You register a mapper that translates incoming event descriptors
into routing decisions:

engine.registerChoreographyMapper(
    descriptor -> {
        // EventDescriptor carries the flow instance ID in streamIdHigh/streamIdLow.
        // The convention: the event producer encodes the target instance UUID
        // in those fields when publishing a choreography trigger.
        long instanceMost = descriptor.streamIdHigh();
        long instanceLeast = descriptor.streamIdLow();
        if (instanceMost == 0 && instanceLeast == 0) {
            return ChoreographyDecision.ignore();
        }
        return ChoreographyDecision.wake(instanceMost, instanceLeast);
    },
    List.of("payment.confirmed"),
    eventBus
);

The bridge implementation uses Java 21 pattern matching on the sealed hierarchy to
dispatch without instanceof chains:

// From FlowChoreographyBridge.handle()
switch (decision) {
    case ChoreographyDecision.Wake(long most, long least) -> {
        scheduler.lookupParked(most, least).ifPresent(scheduler::wake);
        // Stale or duplicate wake event: idempotent no-op if instance no longer parked
    }
    case ChoreographyDecision.Start(FlowExecutionPlan plan, long most, long least) -> {
        scheduler.schedule(plan, newContext(plan, most, least));
    }
    case ChoreographyDecision.Ignore() -> { /* intentional no-op */ }
}

The ChoreographyDecision sealed interface gives the compiler exhaustiveness guarantees.
Adding a new decision variant without handling it in the bridge is a compile error,
not a silent runtime miss. This is the same pattern applied to FlowOutcome in the
step execution loop — sealed types as an architectural fence.

The mapper itself is a @FunctionalInterface. It closes over whatever correlation
state you need. The bridge does not prescribe how you resolve correlation IDs —
that is your domain logic.

Events and Flow: Two Layers, One Deduplication Contract

The Events subsystem (L3) and the Flow engine (L4) have an explicit deduplication split
that is worth stating directly, because it is easy to assume one layer handles it all.

EventBus delivers at-least-once. There is no built-in deduplication at the bus level
— that is an intentional design decision. Built-in bus-level dedup requires shared state
across all subscribers: either a heap-allocated ConcurrentHashMap (GC pressure) or a
distributed lock (latency). Neither is acceptable at the performance tier the Events
subsystem targets.

IdempotencyGuard at L4 covers the Flow-specific case: step-level dedup per Saga instance.
The split is:

Layer	Deduplication	Mechanism
EventBus (L3)	None — at-least-once	Subscriber responsibility
Flow Engine (L4)	Per step, per instance	`IdempotencyGuard.tryClaimStep()`
Application	Per state mutation	`EventDescriptor.eventUuidHigh/Low` as dedup key

This means a choreography wake event can arrive twice — Outbox retry, broker reconnect,
network duplicate. The bridge calls scheduler.lookupParked() then scheduler.wake().
If the flow is no longer parked (already woken by the first delivery), lookupParked()
returns empty and the second wake is a silent no-op. If the flow wakes and re-enters
a step it already executed, IdempotencyGuard skips it and advances. Two safety nets,
neither of which requires coordination across the two layers.

The integration point between them is FlowProgressPublisher: when a flow reaches a
terminal state, it optionally publishes a FlowProgress event to the EventBus. Only
terminal transitions emit — intermediate state changes are deliberately skipped to avoid
allocation on the hot scheduling path.

Idempotency Under Replay

Crash-recovery replay and choreography re-wakes create a real deduplication problem:
a step that already executed successfully should not execute again if the engine
restarts mid-flow.

IdempotencyGuard handles this at the step level:

// From CoreFlowRuntime.runStep() — called inside synchronized(instance.monitor())
if (guard != null && !guard.tryClaimStep(
        instance.key().instanceIdMost(),
        instance.key().instanceIdLeast(),
        stepIndex)) {
    // Step already claimed — skip and advance
    int nextGuardIndex = instance.plan().nextStep(stepIndex);
    if (nextGuardIndex < 0) {
        complete(instance);
        return -1;
    }
    return nextGuardIndex;
}

The default CoreIdempotencyGuard is heap-backed: a ConcurrentMap<FlowKey, ConcurrentMap<Integer, Boolean>>.
The inner map keyed by step index makes releaseInstance() an O(1) removal of the
entire instance entry rather than O(claimed steps). On complete() or
FAILED_ROLLEDBACK, the guard releases the instance claim.

Custom IdempotencyGuard implementations — backed by Redis, a database, or
any other shared store — can be bound via KernelProviders.IDEMPOTENCY_GUARD
before submitting to the engine.

In the upcoming distributed model (ADR-013), idempotency becomes two-layered:
the in-memory heap CAS in CoreIdempotencyGuard, and the durable CAS on
FlowSnapshot.schemaVersion in JdbcFlowSnapshotStore. ADR-013 requires that
these two layers agree on terminal states — if the heap guard reports "already done"
but the durable row claims otherwise, the durable answer wins and heap state is
reconciled on next load. Neither layer can be the sole source of truth in a
distributed deployment.

This is not universal. It only covers the cases where the execution path is still
under the kernel's control. External side effects — the payment was initiated but
the response was lost — remain the caller's responsibility to make idempotent.
The guard prevents the kernel from re-executing a step it already claimed. It cannot
prevent a downstream service from seeing a duplicate call if the step succeeded before
the crash.

What Still Remains True

Compensation is powerful but it comes with a cost: every step that can be rolled back
must be written as an idempotent inverse operation. If your compensation action has
side effects that cannot be reversed — a notification already sent, an audit record
already written — you are modeling a business invariant that compensation cannot
satisfy. The engine provides the mechanism. The correctness is still yours to own.

The terminalStateCatalog in CoreFlowRuntime grows from start() until close().
It is an in-process idempotency fence that prevents re-scheduling already-terminal flows
within a single runtime lifetime. This is acceptable for the current operational model
but it is a known constraint: very long-lived runtimes with high flow volume will see
this map accumulate. Bounded retention is on the v0.7 roadmap — but any policy
must preserve the fence semantics. A terminal flow must not be re-executable.

Cross-restart choreography wake is not yet supported. After a restart, parked flows
survive only as snapshots in FlowSnapshotStore; lookupParked() currently resolves
only from the in-memory index. The v0.7 roadmap closes this via JdbcFlowSnapshotStore
with a parked-enumeration entry point — on startup the engine rehydrates wake routing
from the durable store. Steady-state choreography keeps the in-memory O(1) fast path;
the store probe is fallback-only on miss.

One constraint from that upcoming JDBC implementation is worth noting here: JdbcFlowSnapshotStore
is explicitly prohibited from using ThreadLocal for context propagation. DataSource
is constructor-injected through BootstrapContext. The same constraint that banned
ThreadLocal from the kernel's context propagation layer
runs all the way through to the distributed saga state layer.

Panama FFM does not appear in this layer directly. The allocation constraint is
enforced on the hot scheduling path via FlowZeroAllocTck, but orchestration state
lives on the heap. The off-heap ownership model from the transport layer is not the
right model for a state machine that needs to checkpoint, restore, and evolve
across JVM restarts.

One place where the zero-allocation philosophy does reach into this layer is
EventDescriptor — the routing metadata used by both the Events subsystem and
the choreography bridge. Seven primitive fields: two long pairs for the event
and stream UUIDs, two int for the ordinal and flags bitmap, one long for the
timestamp. The wire codec packs these into exactly 64 bytes — one CPU cache line,
with explicit MemoryLayout padding to fill it. C2 scalarises the record via
Escape Analysis today. When JEP 401 reaches GA, adding the value modifier
requires zero field changes — object headers disappear for free.

Conclusion

StructuredTaskScope is structured in space; Flow is structured in time. The
distinction only becomes visible at the inter-step boundary, where FlowOutcome.PARK
exits the Virtual Thread, persists state, and lets a different Virtual Thread resume
later. No STS abstraction maps to that shape. Choreography handles the case where
the wake comes from outside — sealed ChoreographyDecision and FlowChoreographyMapper
make the event-driven path type-safe and compiler-exhaustive, so adding a new decision
variant is a compile error rather than a silent runtime miss.

The trade-offs stay visible: schema migration requires drain-and-switch, compensation
correctness is the caller's responsibility, and the terminal state catalog is
runtime-scoped. None of these are problems Flow pretends to solve transparently.

The next concrete step is distributed saga state, decided in ADR-013. The model:
a shared durable FlowSnapshotStore (reference implementation: JdbcFlowSnapshotStore
over Postgres in v0.7 Sprint 3) as the source of truth, with Kafka choreography wake
for cross-service saga recovery (Sprint 5/6). Optimistic concurrency via
FlowSnapshot.schemaVersion — already in the SPI with SCHEMA_VERSION_INITIAL = 1L
— resolves concurrent advance attempts from two kernel nodes without a coordinator.

Three approaches were rejected before landing there. A distributed lock service
(ZooKeeper/etcd) puts blocking coordination on the saga advance path — inconsistent
with the No-Waste-Compute contract. CRDT-based state replication is semantically
unsound for the saga model: the compensation stack and step ordering are not
commutative, so concurrent advances cannot be safely merged at the data-structure level.
A single-leader coordinator with Raft/Paxos reintroduces the centralized stateful
component the kernel deliberately avoids.

What remains: cross-service choreography wake, parked-instance enumeration on startup,
and schemaVersion wire-through in RuntimeFlowInstance.toSnapshot(). The schema
migration constraint (drain-and-switch) is a separate problem that distributed saga
state does not soften — that requires explicit FlowSnapshot definition versioning,
which is scoped to a later milestone.

Explore the Exeris Kernel — zero-allocation architecture in running code:
🔗 exeris-systems/exeris-kernel

The Flow subsystem is in exeris-kernel-core and exeris-kernel-spi. The TCK
covering the full lifecycle — submit, park, wake, compensate, crash-recovery — is
in exeris-kernel-tck.

Your TLS Stack Is Lying to You About Zero-Copy

Arkadiusz Przychocki — Fri, 01 May 2026 17:03:13 +0000

The "No Waste Compute" Constraint

When I started designing the Exeris Kernel, I set one non-negotiable rule very early: no waste compute. That rule sounds like a performance slogan until it starts killing otherwise normal design decisions.

I had already banned ThreadLocal, moved context propagation to Scoped Values, and pushed more of the runtime into explicit off-heap ownership. The idea was simple: if the hot path is supposed to stay outside GC pressure, then memory shape and lifetime cannot be treated as incidental details.

Then I reached TLS.

Constraint upfront: this is not a universal argument against SSLEngine. For a normal Java service it is still a perfectly reasonable choice — battle-tested and deeply integrated into the ecosystem. This is about a narrower problem: what happens when TLS sits directly on the hot path of a runtime where off-heap ownership, deterministic cleanup, and zero-allocation execution are hard architectural constraints.

In most Java applications, the TLS layer is just part of the stack. It encrypts bytes, hands them off, and usually gets discussed only when certificates break or latency suddenly becomes visible in production. But in a runtime where every byte on the transport path matters, TLS is not a side concern. It is one of the defining execution boundaries. Every request passes through it. Every response passes through it. If that boundary still speaks in heap-facing contracts, then the rest of the runtime is already adapting to the wrong model.

The root issue I found with SSLEngine was not that it is slow in the abstract, old, or even mainly that it allocates. The deeper problem is that SSLEngine keeps the TLS boundary expressed in terms of JVM-managed buffer objects and heap-visible control flow, while the rest of the runtime is trying very hard to stop doing exactly that.

The Impedance Mismatch in Memory Ownership

The failure showed up at the contract level long before any benchmark.

SSLEngine is shaped around ByteBuffer. You call wrap(src, dst) and unwrap(src, dst). You get an SSLEngineResult back. You stay inside a model where the TLS boundary is expressed through JVM-owned API objects, even if some of the underlying storage uses direct memory.

That matters because I am not trying to reduce heap pressure only statistically in Exeris; I am trying to define explicit ownership all the way through the hot path. What I wanted from the boundary was strict control: the kernel owns the input memory, the kernel owns the output memory, the kernel controls the lifetime, and the kernel can release native state exactly when it decides the work is done.

What SSLEngine gives you is different. It relies on buffer exchange through a JDK object contract and state transitions expressed through JVM return objects. Its cleanup is not shaped around the same explicit ownership model as the rest of the kernel.

In a conventional stack, delayed cleanup is usually acceptable because the whole system already tolerates a lot of deferred work. In an off-heap-first runtime, "cleanup later" is not neutral. It means native TLS state can survive beyond the point where the runtime is logically done with it. Once I noticed this mismatch in ownership semantics clearly, I stopped thinking of SSLEngine as a component to tune and started seeing it as a boundary that belonged to the wrong architecture.

<img src="https://blog.arkstack.dev/blog/your-tls-stack-is-lying-about-zero-copy/fig1_tls_boundary.png" alt="Figure 1: SSLEngine memory contract vs. Exeris Arena ownership model.">

The Netty Question

I looked at Netty's OpenSslEngine directly before committing to the FFM path. It is genuinely fast and battle-tested — and for many systems it is the right answer. But it operates under a different architectural paradigm.

Netty solves the off-heap problem through pooled buffers and manual reference counting (retain() and release()). That is a powerful model, but it comes with a structural tax: the ownership semantics inevitably leak into application code, and forgetting to release a buffer creates notoriously difficult memory leaks. It is still a model bridging JVM objects and native memory through a heavy framework abstraction.

With Panama FFM in Exeris, I don't need reference counting. I get deterministic, strict ownership. Memory boundaries are tied to scopes (like Arena), meaning the lifecycle of the TLS buffer is statically guaranteed by the runtime, not dynamically managed by developers counting references. The boundary is cleaner, and the cost of maintaining it drops.

Explicit State and FFM

To see why this changes the architecture, look at the actual implementation in the Exeris Kernel.

First, I stopped letting the TLS engine silently manage its own lifecycle. In TlsStateMachine, the transitions are deterministic and tied to the kernel's execution context, not left to the garbage collector.

// Snippet from TlsStateMachine.java (Exeris Kernel)
// State transitions are explicitly modeled and bound to the off-heap lifecycle.

public void advanceState(TlsEvent event) {
    // I enforce strict state progression before any native call is made.
    // There is no ambiguous "maybe it's closed" state lingering on the heap.
    if (currentState == TlsState.HANDSHAKE && event == TlsEvent.APP_DATA) {
        throw new IllegalStateException("Cannot process application data during handshake");
    }
    // ... explicit state handling
}

Second, I mapped the actual cryptographic operation directly via Panama's FFM in OffHeapTlsEngine. Notice that I am not wrapping heap arrays. I am passing raw memory segments or delegating file descriptors directly to native OpenSSL functions.

// Snippet from OffHeapTlsEngine.java (Exeris Kernel)
// Zero-allocation FFM call: the raw off-heap address is passed directly —
// no MemorySegment wrapper object is created on the hot path.

public int writeRaw(MemorySegment sourceSegment) {
    // 1. The memory is off-heap and strictly owned by an Arena.
    // 2. Pass the raw native address (a long) directly via FFM downcall.
    try {
        long srcAddr = sourceSegment.address();
        return SSL_write(sslHandle, srcAddr, (int) sourceSegment.byteSize());
    } catch (Throwable t) {
        throw new TlsNativeException("FFM downcall to SSL_write failed", t);
    }
}

The trade-off here is explicit: I lose the safety net of ByteBuffer bounds checking and GC cleanup. In return, I gain absolute control over the data path.

What the Exploratory Benchmarks Prove

I prefer brutal transparency over carefully curated optimization claims. The native FFM TLS path in Exeris is still taking shape, but the early exploratory JMH results confirm exactly what I expected structurally.

I tested four distinct architectural models:

JDK SSLEngine: The standard heap-facing boundary (in-memory direct only).
Netty tcnative: Off-heap via JNI and reference-counted ByteBuf (embedded channel pipeline).
Exeris FFM (Memory BIO): Native TLS via Panama, where the runtime explicitly owns the memory (in-process).
Exeris FFM (FD Owner): The absolute hot path. OpenSSL is bound directly to the socket file descriptor, bypassing intermediate memory buffers entirely (write-loopback).

Architecture	Memory Boundary	Throughput	Allocation (Per 1KB Record)
JDK SSLEngine	Heap (`ByteBuffer`)	~905k ops/s	~2,528 B/op
Netty tcnative	Off-heap (`ByteBuf`)	~856k ops/s	~560 B/op
Exeris Memory BIO	Off-heap (Panama `Arena`)	~922k ops/s	0 B/op
Exeris FD Owner	Direct Socket OS boundary	~367k ops/s	0 B/op

(Methodology: JMH gc phase, Oracle JDK 26 GA, ZGC, commit f778683, 2026-05-01. The Memory BIO profile phase additionally confirmed via JFR: zero jdk.GarbageCollection events recorded — ZGC never ran a single collection during the entire benchmark run. Full suite in exeris-benchmarks.)

<img src="https://blog.arkstack.dev/blog/your-tls-stack-is-lying-about-zero-copy/fig2_fd_owner_path.png" alt="Figure 2: The data path of Memory BIO vs FD Owner directly binding to the socket descriptor.">

Let's unpack what these numbers actually mean, because context matters more than raw digits.

In a pure in-process memory test, the Exeris Memory BIO implementation outpaces both standard SSLEngine and Netty's tcnative. The runtime achieves ~922,000 ops/s without paying the structural tax of heap-facing buffer exchanges.

But the most important architectural metric is the last row: Exeris FD Owner.

A naive reading would ask why the throughput dropped to ~367,000 ops/s. The answer is that the FD Owner benchmark leaves the synthetic in-process memory arena entirely. It writes directly to the OS loopback interface via socket file descriptors. At this stage, I am no longer benchmarking memory copy operations; I am hitting the limits of the OS network stack and syscalls.

The GC Layer and the True Cost of Abstractions

What changed my mind was not the ops/s number. It was what -prof gc showed underneath it.

To process a standard 1024-byte payload, SSLEngine allocates over 2.5 Kilobytes of garbage (gc.alloc.rate.norm). The TLS layer generates more heap waste than the data it encrypts. I had already pushed the rest of the hot path off-heap — and the GC profiler was telling me the TLS boundary was quietly undoing that work on every record.

By contrast, the Exeris FFM paths drop the normalized allocation rate to strict zero. (The profiler registers ~0.01 B/op with zero actual GC counts, which is standard JMH measurement noise for absolute zero).

This is the core definition of "No Waste Compute." By eliminating the intermediate buffer tier completely, the kernel fundamentally changes the garbage collector's job. It stops doing TLS cleanup entirely. ZGC is no longer forced to clean up after the cryptography layer.

<img src="https://blog.arkstack.dev/blog/your-tls-stack-is-lying-about-zero-copy/fig3_gc_allocation_rate.png" alt="Figure 3: Allocation rate (garbage generated) per 1KB payload across different TLS architectures.">

Where SSLEngine Still Wins

A few things remain true even after this architectural shift.

First, SSLEngine is still the right answer for the vast majority of systems. If I were building a normal Spring Boot application, a Netty service, or anything where the goal was strong operational simplicity with conventional JVM trade-offs, I would not force a native TLS path into the design.

Second, direct buffers and pooling still matter. This is not an article pretending the entire existing Java ecosystem is naive.

Finally, Panama FFM and native TLS do not remove complexity—they relocate it. You get absolute control, but you also inherit absolute responsibility for lifecycle, correctness, and failure modes. This is an architectural decision for a highly specialized kernel, not a generic industry recommendation.

What I Changed, and What I Gave Up

A lot of JVM performance work still assumes the heap is the center of the system and the goal is simply to make it hurt less. That is a valid way to design software, but it is not the design I wanted for Exeris.

Once the runtime moved toward explicit off-heap ownership, SSLEngine stopped looking like a harmless standard abstraction and started looking like the one boundary that could quietly drag the whole transport path back into the wrong model.

I dropped it because for this specific runtime, it speaks the wrong language. If the hot path is supposed to be off-heap and deterministic by design, then TLS has to speak that language too.

The FFM native TLS implementation and the explicit ownership model are built entirely off-heap in the Exeris Kernel. If you want to verify the numbers, run the code, or explore zero-allocation architecture:

Review the metrics: exeris-systems/exeris-benchmarks
Inspect the Kernel and leave a star: exeris-systems/exeris-kernel

StructuredTaskScope beyond toy examples: dependency-aware kernel bootstrap in modern Java

Arkadiusz Przychocki — Tue, 07 Apr 2026 10:24:29 +0000

I did not start this because I wanted to write an article about StructuredTaskScope.

I got there from a more annoying direction: bootstrap had stopped being a startup script.

Once the kernel had a real subsystem graph — config, memory, persistence, graph, events, flow, transport — the old mental model broke down. The question was no longer "how do I start modules?" It became "what is actually allowed to start now, what must already be ready, and what happens if one piece fails halfway through?"

That is a different problem from request fan-out.

This article is a follow-up to my earlier piece on DOP, ScopedValue, and Loom. There, I used StructuredTaskScope as a clean example of native fail-fast execution. Here I want to show the more useful case: what happened once I tried to fit it into a real lifecycle model.

Constraint upfront: this only makes sense when the execution path is still under my control. If you are building a plugin surface or a highly open extension model, parts of this break down quickly.

Bootstrap stopped being linear

A lot of startup code still assumes the system is basically a list:

build some objects
call start()
maybe wait a bit
hope shutdown is the reverse

That works until the dependency graph becomes real.

In Exeris, bootstrap is constrained by subsystem relationships, not by the order I happen to like in a main() method. Some subsystems are foundational. Some are optional. Some can start only after several others are already running. Some failures can degrade. Some cannot.

At that point, startup becomes a graph problem whether you admit it or not.

What I kept from the old model was determinism.
What I dropped was the idea that everything meaningful should happen inside one generic "start all modules" phase.

The shape of the graph matters more than the urge to parallelize it.

Figure 1: Dependency-aware kernel bootstrap graph in Exeris. The point is not that several subsystems exist. The point is that concurrency is legal only where the graph permits it.

I noticed that once the graph was explicit, "just parallelize bootstrap" stopped being a serious answer pretty quickly. The graph already tells you where concurrency is allowed and where it is simply too early.

The bootstrap docs in Exeris describe the same thing from the subsystem side: L0 remains foundational, higher layers can move only after the substrate is ready, and shutdown keeps that structure in reverse.

The split that actually mattered

The design choice that mattered most was not using StructuredTaskScope.

It was deciding where not to use it.

At first, the obvious temptation was to parallelize more of bootstrap. If the JVM gives you virtual threads and structured concurrency, it is very easy to start looking for places to apply them.

I ended up doing less than that.

I kept initialize() sequential and topological.
I only allowed structured parallelism in start().

That was not ideological. It was practical.

Initialization is where the orchestrator builds structure:

provider bindings
health registration
active subsystem ordering
dependency-safe lifecycle state
bootstrap telemetry hooks

That phase wants determinism more than it wants speed. I did not want graph construction, provider composition, and lifecycle execution to collapse into one concurrent blur.

Startup is different. Once the graph is already resolved and the active set is known, concurrency becomes useful — but only if it stays inside the same lifecycle boundaries the graph already established.

That led to a much simpler rule:

initialization stays ordered, startup may become parallel.

This was the point where the old model stopped making sense for me. I was no longer trying to make bootstrap faster in the abstract. I was trying to keep lifecycle ownership readable.

for (BootstrapPhase phase : BootstrapPhase.values()) {
    List<Subsystem> forPhase = orderedSubsystems.stream()
            .filter(s -> s.phase() == phase)
            .toList();
    if (forPhase.isEmpty()) {
        continue;
    }

    if (phase == BootstrapPhase.FOUNDATION) {
        startSequential(forPhase, phase, profileName, startedNames);
    } else {
        startParallel(forPhase, phase, profileName, startedNames);
    }
}

In practice, FOUNDATION stays sequential on purpose. That includes the parts of bootstrap that decide whether the rest of the kernel can even be interpreted correctly: configuration roots, base runtime substrate, exception boundaries, and core providers.

I could have parallelized more of that. I did not.

The trade-off is deliberate:

I give up some startup parallelism early
in exchange for a cleaner substrate
and less ambiguity when the higher layers begin to move

This is not universal. If your startup graph is shallow and your layers are genuinely independent, you can be more aggressive. In my case, the real cost of a bad foundation was not a few extra milliseconds. It was a fuzzier lifecycle model and harder-to-classify failures later.

ScopedValue still mattered at the boundary

This article is about StructuredTaskScope, but I ended up reusing the same lesson from the previous piece: context propagation only stays clean if the boundary is explicit.

In Exeris, bootstrap resolves configuration once, then binds it at the kernel boundary before the rest of the lifecycle begins. Everything spawned under that boundary inherits the same immutable context.

try {
    ScopedValue.where(KernelProviders.CURRENT_CONFIG, config)
            .call(() -> {
                runBootInsideScope(orchestrator, config, configRegistry, configWatcher, kernelMain);
                return null;
            });
} catch (SubsystemCircularDependencyException ex) {
    throw ex;
} catch (SubsystemOrchestrator.BootstrapException ex) {
    throw new BootstrapException("Subsystem bootstrap failed: " + ex.getMessage(), ex);
}

That choice mattered more than another layer of constructor wiring would have.

I did not want every subsystem, handler, or virtual thread to receive config through argument threading just because bootstrap needed lifecycle scope. I also did not want to fall back to ThreadLocal and reintroduce the same inheritance and mutability problems I had already rejected elsewhere.

So the boundary stayed strict:

config is resolved once
bound once
inherited downward
and torn down when boot exits

That kept the lifecycle model cleaner. It also meant that when I later opened structured startup rounds, they inherited the same immutable runtime context without extra ceremony.

The useful part was not STS itself

The useful part was computing a safe round before opening a scope.

I do not want to smooth this into a generic explanation, because it is really the center of the design.

The orchestrator does not just fork all pending subsystems for a phase and wait.

It first computes which subsystems are actually safe to start now.

Set<String> pendingNames = pending.stream()
        .map(Subsystem::name)
        .collect(java.util.stream.Collectors.toCollection(LinkedHashSet::new));

List<Subsystem> ready = pending.stream()
        .filter(subsystem -> dependenciesReadyForRound(subsystem, pendingNames, startedNames))
        .toList();

if (ready.isEmpty()) {
    throw new BootstrapException(
            "Phase " + phase + " cannot make progress: unresolved dependencies among pending subsystems "
            + pendingNames);
}

That changed the role of StructuredTaskScope completely.

It was no longer responsible for discovering order.
It was responsible for executing one dependency-safe round inside an order the orchestrator had already made explicit.

That is why I keep saying this is a graph problem first and a concurrency problem second.

Figure 2: Dependency-safe startup round. The orchestrator computes eligibility first, then gives StructuredTaskScope a bounded unit of work to own.

This was the point where the old "just launch it and coordinate later" model stopped making sense. I did not want startup order to become an emergent property of timing, future composition, or whichever task completed first.

I wanted concurrency to appear after dependency eligibility had already been established.

This is where StructuredTaskScope actually earned its place

Once the ready set exists, the role of StructuredTaskScope becomes very narrow and very clean.

It owns one startup round.

That is it.

try (var scope = StructuredTaskScope.open()) {
    List<StructuredTaskScope.Subtask<Object>> tasks = ready.stream()
            .<StructuredTaskScope.Subtask<Object>>map(
                    subsystem -> scope.fork(() -> {
                        doStart(subsystem, phase, profile);
                        return null;
                    }))
            .toList();

    scope.join();

    List<Throwable> failures = tasks.stream()
            .filter(task -> task.state() == StructuredTaskScope.Subtask.State.FAILED)
            .map(StructuredTaskScope.Subtask::exception)
            .toList();

    if (!failures.isEmpty()) {
        Throwable first = failures.getFirst();
        throw new BootstrapException(
                failures.size() + " subsystem(s) failed in phase " + phase
                + ". First failure: " + first.getMessage(), first);
    }
}

This is the part I actually like.

Not because it is clever. Mostly because it is boring in the right way.

The round has:

an owner
explicit lifetime
explicit completion
explicit failure collection

No task belongs to some vague executor that outlives the lifecycle moment that created it. No background startup work escapes into "maybe still running" territory. The concurrency boundary finally matches the lifecycle boundary.

That was the point.

And that is also why I think StructuredTaskScope is more interesting here than in the usual "fetch two things in parallel" examples. Those examples prove the API works. This kind of orchestrator is where it starts to fit the shape of the system.

I could have done this with futures. I did not want to.

There is nothing impossible about building this with:

ExecutorService
CompletableFuture
latches
custom worker tracking
hand-rolled failure aggregation

If the goal was just "run multiple startup actions in parallel," all of those would work.

But the real problem was never just parallelism.

What I actually cared about was:

who owns this work
when exactly this round ends
what belongs to this phase and what does not
how failure is surfaced
how shutdown reasoning stays clean

Bootstrap is one of the worst places to tolerate vague concurrency ownership. When startup fails, I do not want to guess whether some task is still alive in the background or whether a future chain has already detached from the lifecycle moment that spawned it.

That is the real difference here.

The point is not that StructuredTaskScope can run tasks. The point is that it gives this round a proper boundary.

I would still use more conventional concurrency tools when the lifecycle is shallower, the ownership model is already loose, or the surrounding architecture does not benefit from such a strict boundary. This is not a universal replacement story.

The failure policy mattered at least as much as the concurrency primitive

I do not think this model would feel coherent without an explicit failure policy.

Exeris supports both:

FAIL_FAST
DEGRADE

But not symmetrically.

Foundational subsystems are still mandatory. They do not get to degrade just because higher layers can. That boundary matters.

Inside the orchestrator, that asymmetry is explicit. Optional subsystems may be removed under DEGRADE, but a mandatory failure still aborts boot.

boolean isMandatory =
        (subsystem.phase() == BootstrapPhase.FOUNDATION) || !subsystem.isOptional();

if (failurePolicy == FailurePolicy.DEGRADE && !isMandatory) {
    removeSubsystemAndTransitiveDependents(subsystem.name());
} else {
    healthMonitor.markKernelState(KernelHealthMonitor.KernelState.FAILED);
    throw new BootstrapException(
            "Subsystem '" + subsystem.name() + "' failed: "
            + failure.getMessage(), failure);
}

I kept that asymmetry because not all failures mean the same thing. An optional higher-level capability failing to start can be survivable. A foundation-layer failure usually means the system no longer has a sane substrate to run on.

This was another place where I resisted smoothing the model into something more uniform. Uniformity would have looked cleaner on paper, but it would have made the lifecycle semantics less truthful.

So the useful question was never:

can these tasks run in parallel?

It was:

what does it mean for the kernel if this one fails right now?

That question forces the architecture to stay honest.

Startup only makes sense if shutdown keeps the same shape

One thing I did not want to lose was lifecycle symmetry.

A graph-shaped startup model should not collapse into an improvised shutdown path. If startup order is derived from dependency structure, shutdown should preserve that structure in reverse.

In practice, that means I care about reverse topological shutdown just as much as startup rounds.

List<Subsystem> reversed = new ArrayList<>(orderedSubsystems);
Collections.reverse(reversed);

for (Subsystem subsystem : reversed) {
    if (subsystem.isRunning()) {
        subsystem.stop();
    }
}

That sounds obvious, but it matters more once concurrency enters the picture. Structured startup is easier to trust when the rest of the lifecycle still behaves like one coherent model instead of a collection of unrelated hooks.

Figure 3: Lifecycle symmetry in Exeris. Startup derives capability from dependency order; shutdown preserves that order in reverse so the lifecycle remains one coherent model.

I would not make reverse shutdown the headline of the article, but I would not treat it as a footnote either. It is part of the same argument: structured concurrency helps most when the surrounding lifecycle is already structured.

What I measured, and what I did not claim

This article is architectural first, but I do not want to leave it floating at the level of "this feels cleaner."

The evidence I care about here is not generic throughput. It is lifecycle evidence.

For this model, the useful signals are things like:

sequential startup vs phase-grouped structured startup
total cold boot duration until boot-ready
per-phase startup timing
round timing in parallel phases
repeated cold-start variance
degraded boot timing when optional subsystems are removed
jdk.VirtualThreadPinned events during startup
final active subsystem count recorded at boot-ready

That is why I treat JFR as part of the architecture here, not just as a performance tool. If the startup model is real, it should leave a readable lifecycle trace behind it.

The bootstrap documentation in Exeris already treats startup telemetry as part of the contract: boot-ready, shutdown completion, dependency-cycle detection, and lifecycle state are all first-class signals rather than incidental logs.

I also have some early exploratory startup measurements, although I am deliberately treating them as supporting evidence rather than a headline claim.

Metric	Exeris community h1	Quarkus JVM VT tuned
Startup → health-ready	1132 ms	2182 ms
Startup → first request	1205 ms	2432 ms
Health-ready → first request	73 ms	250 ms

These measurements were taken on dev hardware and remain sensitive to local runtime conditions, including machine state and GUI / no-GUI setup. So I am not using them to claim broad startup superiority yet.

What they do show already is narrower, but still useful: this bootstrap model is measurable in operational terms. It is not just architecturally cleaner on paper.

The smaller health-ready → first-request gap is especially interesting to me, because it suggests the lifecycle boundary is not only short on paper but also closer to usable work.

I have also validated the same runtime under a constrained exploratory profile with zero request errors, but that belongs to a different discussion than this article. The point here is narrower: the lifecycle model is observable and survives contact with measurement.

I am also deliberately keeping the claim scope narrow. Early measurements are useful, but they are still sensitive to:

classloading state
JIT state
machine noise
native load conditions
startup environment shape

So I would rather say:

this model is measurable and operationally inspectable

than jump too quickly to:

this model is definitively faster.

That comes later, if the data actually holds.

What this does not solve

This model does not fix bad subsystem boundaries.

It does not fix circular graphs.
It does not fix startup work that should not exist in the first place.
It does not mean every subsystem should suddenly become parallel.
And it definitely does not generalize to every runtime architecture.

I would still use more conventional patterns when:

the graph is shallow
the lifecycle is simpler
subsystem ownership is fuzzy by design
plugin-style openness matters more than deterministic startup shape

This is not a universal recipe. It applies when the execution path is still under my control and the lifecycle itself is part of the architecture.

That boundary matters.

What I kept, what I dropped

I think this is the part that usually gets lost when articles get too polished.

I did not end up with a universal "use STS for bootstrap" rule.

What I kept:

topological lifecycle order
deterministic initialization
explicit phase boundaries
explicit failure policy
reverse shutdown symmetry

What I dropped:

the idea that initialize() and start() should be the same phase
the idea that startup parallelism should be maximal
the idea that concurrency should appear before dependency safety is known

I also considered using StructuredTaskScope in other places where it looked fashionable on paper. In at least one case, it simply did not buy me anything meaningful, so I left it out.

That contrast was useful. It made the bootstrap use case clearer. StructuredTaskScope was not valuable because it was new. It was valuable because this part of the system already had a natural owner, a natural boundary, and a natural failure model.

Conclusion

A few things became clearer to me while building this:

Bootstrap became a graph problem before it became a concurrency problem.
That changed which part of the design actually needed structure.
StructuredTaskScope helped only after order was already explicit.
The useful move was not "fork everything" but "compute a safe round, then run it inside a bounded scope."
The trade-off is intentional.
I kept initialize() and FOUNDATION sequential on purpose. I gave up some parallelism to keep lifecycle ownership and failure semantics easier to reason about.

What this unlocks next in Exeris is not just cleaner startup code. It gives me a more inspectable lifecycle model for later work around health, telemetry, subsystem isolation, and eventually more demanding cold-start contracts.

If you want to see what this looks like outside a toy example, the bootstrap and lifecycle code is in the Exeris Kernel repository.

Explore the Exeris Kernel — zero-allocation architecture in running code:
🔗 exeris-systems/exeris-kernel

Reevaluating 1990s OOP in Java: DOP, Scoped Values, and Loom in 2026

Arkadiusz Przychocki — Sat, 21 Mar 2026 09:17:18 +0000

For decades, the Gang of Four (GoF) design patterns were the standard for object-oriented programming. If you had conditional behavior, you built a Factory. If you had interchangeable algorithms, you built a Strategy.

While these patterns remain foundational, applying them blindly in modern Java (21 through 26) often introduces an unnecessary Abstraction Tax.

When engineering the Exeris Kernel, my goal was "No Waste Compute." This didn't mean chasing micro-optimizations, but rather rethinking control flow, context propagation, and concurrency using modern JVM primitives.

Here is a pragmatic look at where the JVM is heading, and how Data-Oriented Programming (DOP) combined with Project Loom changes the architectural calculus for closed-domain systems.

The Enterprise Reality Check & Disclaimers

Before diving in, let's ground this in reality.

1. The Migration Path: You don't rewrite systems overnight.

Java 21 gives you Records, Sealed Interfaces, Pattern Matching, and Virtual Threads (GA). You can adopt DOP today.
Java 25 brings ScopedValue to GA, fixing context propagation.
Java 26+ further stabilizes StructuredTaskScope (STS). Disclaimer: As of JDK 26, STS is in its 6th preview (JEP 525). Running it in production requires --enable-preview and organizational buy-in. It is highly stable, but it is not a final standard yet.

2. The "Closed World" Constraint: The DOP approaches shown below are designed for Closed-World domains (e.g., the core business logic of a specific microservice). If you are building an Open-World system (a plugin architecture, an extensible framework, or an SPI), you must respect the Open/Closed Principle. In those cases, sealed interfaces are the wrong tool, and traditional polymorphism with Service Registries remains the correct choice.

Step 1: The Legacy Approach (Understanding the Tax)

Historically, to process different payment methods, we relied on polymorphic Strategy classes managed by a Factory. If we needed to pass context (like a Transaction ID), we relied on ThreadLocal.

public class PaymentService {
    public void process(String type) {
        PaymentStrategy strategy = PaymentFactory.get(type);
        strategy.process();
    }
}

Let's be clear: allocating a single 24-byte Strategy object per request will not kill your heap. The true "Abstraction Tax" at scale comes from the combination of indirection: cognitive overhead, polymorphic dispatch costs on extreme hot-paths, deep object graph complexity, and crucially, the severe memory overhead of copying InheritableThreadLocal maps when spawning thousands of Virtual Threads.

Step 2: Data-Oriented Programming

In modern Java, we can separate data from behavior. Instead of a polymorphic Factory returning Strategy objects, we use Sealed Interfaces and Records to model our domain, and Pattern Matching for dispatch.

In DOP, records must guarantee the validity of their state at creation using Compact Constructors.

import java.math.BigDecimal;

// 1. A closed hierarchy. The compiler ensures exhaustiveness.
public sealed interface PaymentMethod permits CreditCard, Blik {}

// 2. Compact Constructors prevent invalid state
public record CreditCard(String cardNumber) implements PaymentMethod {
    public CreditCard {
        if (cardNumber == null || !cardNumber.matches("\\d{16}")) {
            throw new IllegalArgumentException("Invalid card format");
        }
    }
    public String getLastFourDigits() { return cardNumber.substring(12); }
}

public record Blik(String code) implements PaymentMethod {
    public Blik {
        if (code == null || !code.matches("\\d{6}")) {
            throw new IllegalArgumentException("Invalid BLIK code");
        }
    }
}

For closed-domain dispatch, this eliminates the Factory entirely. You get compile-time exhaustiveness, guaranteed data validity, and more predictable control flow.

The Memory Footprint Shift

Step 3: Scoped Values (Context Without the Leak)

If nested logic needs a Transaction ID for logging, InheritableThreadLocal becomes a massive bottleneck. Copying state across millions of Virtual Threads destroys the lightweight nature of Project Loom.

In JDK 25, Scoped Values are GA. ScopedValue provides immutable, downward-only data flow. It drastically reduces inheritance overhead and automatically vanishes when the execution scope exits.

public class PaymentContext {
    public static final ScopedValue<String> TX_ID = ScopedValue.newInstance();
}

// Binding the context
ScopedValue.where(PaymentContext.TX_ID, "TX-9981").run(() -> {
    System.out.println("Processing ID: " + PaymentContext.TX_ID.get());
});

(Caveat: Because ScopedValues are strictly downward, patterns like updating MDC context deep in the call stack require architectural adjustments).

Step 4: Execution (Pure DOP + Structured Concurrency)

We have fixed data modeling (DOP) and state propagation (ScopedValue). Now we need execution.

If we execute a payment while simultaneously calling a fraud service, and the fraud service fails, the payment must abort instantly (Fail-Fast).

In 2026, many still use libraries like Resilience4j for this. To be clear: StructuredTaskScope does not replace retries or circuit breakers (which belong in your Service Mesh/Envoy layer). However, STS natively replaces application-layer bulkheads and timeouts.

Here is the implementation using JDK 26 Preview API. We fork virtual threads and pass validated data records directly to a function.

import java.math.BigDecimal;
import java.util.UUID;
import java.util.concurrent.StructuredTaskScope;
import java.util.concurrent.StructuredTaskScope.Subtask;

public record PaymentRequest(BigDecimal amount, PaymentMethod method) {}

public class PaymentOrchestrator {

    public void handle(PaymentRequest request) {
        String txId = UUID.randomUUID().toString().substring(0, 8);
        ScopedValue.where(PaymentContext.TX_ID, txId).run(() -> {
            executeConcurrentWorkflow(request);
        });
    }

    private void executeConcurrentWorkflow(PaymentRequest request) {
        // Enforces a strict concurrency model with ownership constraints
        try (var scope = StructuredTaskScope.open()) {

            Subtask<String> paymentTask = scope.fork(() -> executePayment(request.method(), request.amount()));
            Subtask<Boolean> fraudTask = scope.fork(() -> performFraudCheck(request));

            // Automatically interrupts sibling tasks if one fails
            scope.join();

            System.out.println("Payment: " + paymentTask.get());
            System.out.println("Fraud Check: " + (fraudTask.get() ? "Clear" : "Suspicious"));

        } catch (StructuredTaskScope.FailedException e) {
            System.err.println("Transaction aborted: " + e.getCause().getMessage());
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
    }

    private String executePayment(PaymentMethod method, BigDecimal amount) throws InterruptedException {
        System.out.println("[Tx: " + PaymentContext.TX_ID.get() + "] Executing...");
        Thread.sleep(500); // Simulate I/O

        return switch (method) {
            case CreditCard c -> "Processing card ending in " + c.getLastFourDigits();
            case Blik b       -> "Authorizing BLIK code: " + b.code();
        };
    }

    private boolean performFraudCheck(PaymentRequest request) throws InterruptedException {
        Thread.sleep(300);
        return true;
    }
}

The Native Fail-Fast Architecture

The Pragmatic Verdict

What replaces traditional GoF patterns in closed-domain systems is not a new pattern, but a shift in native JVM primitives: data (Records), context (ScopedValue), and execution (StructuredTaskScope). By combining these primitives, we achieve:

Reduced Indirection: We pass strictly validated records to functions running on virtual threads, bypassing proxy layers and improving predictability.
Data Safety: Compact Constructors ensure invalid state never enters the pipeline.
Native Fail-Fast: StructuredTaskScope natively handles cancellation propagation across thread boundaries.

System design in 2026 isn't about entirely removing OOP or chasing zero-allocation myths. It is about understanding which infrastructure problems are now solved natively by the JVM and your Service Mesh—and pragmatically removing the workarounds we used to rely on.

If you want to see how this model scales beyond simple request handling into a durable, off-heap runtime, take a look at the Exeris Kernel repository on GitHub.

Why I Banned ThreadLocal from the Exeris Kernel (And What Replaced It)

Arkadiusz Przychocki — Fri, 06 Mar 2026 18:43:10 +0000

When I started designing the Exeris Kernel — a next-generation, zero-copy runtime built for Java 26+ — I established one non-negotiable architectural law: "No Waste Compute."

In a system designed to handle extreme density by mapping exactly one Virtual Thread to every network stream (1-VT-per-Stream), every byte of memory and every CPU cycle must be intentional.

But very quickly, I hit a legacy wall.

In the standard Enterprise Java ecosystem, when you need to pass a SecurityContext, a TenantId, or a TransactionID down to the database layer without polluting dozens of method signatures, you reach for a trusted tool: ThreadLocal. For over two decades, ThreadLocal was the backbone of Java framework magic. But in the era of Project Loom (JEP 444) and Structured Concurrency, this old friend becomes a performance serial killer.

Here is why I enforced a strict, kernel-wide ban on ThreadLocal in Exeris, and how adopting JEP 506 (Scoped Values) completely changed the game for high-performance architecture.

The Forensic Analysis: The 3 Sins of ThreadLocal

Treating Virtual Threads like OS threads discards most of their scalability advantages — especially around context propagation and allocation behavior. When you combine ThreadLocal with a highly concurrent, thread-per-request architecture, you introduce three critical flaws:

1. The Spaghetti State (Unconstrained Mutability)

Any code deep in the call stack that can read a ThreadLocal can also call .set() on it. If a nested library mutates the SecurityContext mid-flight, tracking down who changed it and when is a debugging nightmare. Data flow becomes completely unpredictable.

Figure 1: The uncontrolled mutability of ThreadLocal versus the strict, read-only data flow guarantees of a lexically bounded ScopedValue.

2. The Memory Leak Trap (Unbounded Lifetime)

A ThreadLocal survives until the thread dies or someone explicitly calls .remove(). In legacy thread pools, forgetting to clean up means a security context bleeds into the next user's request.

3. The Inheritance Tax (The RAM Killer)

This is the fatal blow. To share context with child threads, frameworks use InheritableThreadLocal. When a parent thread creates a child, the JVM must eagerly clone the parent's ThreadLocalMap. This typically allocates between 32 and 128 bytes per entry on the heap, depending on the load factor and key distribution.

Now, imagine a single HTTP request where your logic forks 50 concurrent sub-tasks (Virtual Threads) to fetch data. You just triggered 50 expensive map allocations. Multiply that by 10,000 concurrent requests, and your Garbage Collector stalls your application just to clean up useless context clones. This becomes a pure GC tax with no business value.

Figure 2: The O(N) memory copy penalty of InheritableThreadLocal compared to the O(1) constant-time pointer inheritance introduced in JEP 506.

The Missing Link: Structured Concurrency Incompatibility

Beyond performance, ThreadLocal is fundamentally incompatible with Structured Concurrency. StructuredTaskScope relies on deterministic, tree-like execution where child tasks are strictly bound to the lifetime of their parent. ThreadLocal, being non-deterministic and fully mutable at any level of the tree, completely breaks this model.

You cannot build a reliable, fail-fast concurrent tree if any leaf node can secretly mutate the global state of the branch.

Exhibit A: The Zero-Waste Solution (JEP 506)

To survive millions of Virtual Threads, we need a mechanism that is immutable, temporally bounded, and virtually free to inherit. Enter Scoped Values.

Instead of a globally mutable variable, a ScopedValue defines a Dynamic Scope. It binds a value to a specific block of code (and all methods called within it). Once the block finishes, the binding vanishes.

The Scoreboard

	ThreadLocal	ScopedValue
Immutability	Mutable (Anyone can overwrite)	Immutable (Read-only for callees)
Lifetime	Unbounded (Requires manual cleanup)	Lexically bounded (tied to the `.run()` block)
Inheritance Cost	O(N) memory copy	O(1) constant-time inheritance with negligible allocation cost

Exhibit B: "Show, Don't Tell" — The Exeris Implementation

In the Exeris Kernel, context propagation is strictly separated. The Security module authenticates, and the Persistence module applies Row-Level Security. They never talk directly. They communicate purely through an "Invisible Wall" using ScopedValue.

Figure 3: Context propagation in the Exeris Kernel. Security and Persistence modules remain completely decoupled, sharing identity strictly through an immutable dynamic scope.

Here is how identity is injected at the gateway. Notice the complete absence of .set() methods:

// 1. Decode token directly from off-heap memory (Zero-Alloc)
AuthenticationResult result = securityProvider.authenticate(tokenBuffer);

// 2. Open a lexically bounded, immutable Dynamic Scope
// Note: Chained .where() calls create efficient nested scopes.
ScopedValue
    .where(KernelProviders.PRINCIPAL_CONTEXT, result.principal())
    .where(KernelProviders.STORAGE_CONTEXT,   result.storage())
    .run(() -> {
        // Inside this block, the context is safe.
        // It will be inherited by any Virtual Thread spawned via StructuredTaskScope.
        dispatchRequest(request);
    });

// 3. Scope closes automatically. No .remove() needed. Zero leaks.

Later, deep in the Persistence module, the TransactionOrchestrator needs to know the Tenant ID to append it to the SQL query. It simply queries the active scope:

public class TransactionOrchestrator {

    private static StorageContext resolveStorageContext() {
        // Zero ThreadLocal, fully Virtual-Thread safe (JEP 506)
        // isBound() is an O(1) check
        if (KernelProviders.STORAGE_CONTEXT.isBound()) {
            return KernelProviders.STORAGE_CONTEXT.get();
        }
        // Fallback to system context without allocating objects
        return ImmutableStorageContext.system();
    }

    // ... transaction execution logic
}

Because ScopedValue is immutable, the TransactionOrchestrator is guaranteed by lexical scoping and immutability that the StorageContext it reads is exactly the one set by the gateway, untampered by any interceptor along the way.

The Paradigm Shift

By ripping ThreadLocal out of the kernel, we eliminated an entire category of memory leaks and GC pressure. When a system spawns 1,000,000 Virtual Threads, the difference between "copying a map 1 million times" and "sharing a pointer in constant time" is the difference between a crashed server and a stable infrastructure.

Java 26 is not just "Java 8 with var". Features like Project Loom, Panama (FFM), and Scoped Values require a fundamental shift in how we architect systems. If we keep building frameworks using patterns from 2014, we will never unlock the true performance of modern hardware.

Would you be willing to refactor your application to drop ThreadLocal and embrace ScopedValue? Let me know in the comments.

Explore the Exeris Kernel

The zero-allocation architecture described in this article isn't just theory — it's running code. Exeris is an open-core, post-container cloud kernel built for extreme density. If you're tired of GC pauses and want to see how native I/O, Panama FFM, and Virtual Thread orchestration look in practice, explore the Exeris Kernel:

🔗 GitHub Repository: exeris-systems/exeris-kernel

Welcome to Arkstack — JVM Performance, Off-heap Memory & Low-Latency Architecture

Arkadiusz Przychocki — Fri, 06 Mar 2026 18:36:07 +0000

Welcome to Arkstack.dev — a minimalistic, developer-focused blog with zero unnecessary fluff.

This is the digital home of Arkadiusz Przychocki, Lead Cloud Architect & Full-Stack Engineer. If you care about clean systems, modern JVM engineering, and cloud-native architecture done right, you're in the right place.

What to Expect

This blog covers the intersection of:

Cloud Architecture — designing resilient, cost-efficient systems on AWS, GCP, and Azure. Infrastructure as code, distributed systems, and the hard lessons learned in production.
Modern Java — the language is evolving faster than ever. Virtual threads (Project Loom), sealed classes, pattern matching, and the growing ecosystem around GraalVM native image compilation.
Zero-Waste Engineering — every abstraction has a cost. We care deeply about what actually runs on the metal.

Exeris Kernel

One of the flagship projects I'm working on is Exeris Kernel — a zero-allocation runtime designed for ultra-low-latency workloads. The core philosophy:

Every allocation is a liability. Every system call is a negotiation. We opt out of both wherever possible.

Exeris Kernel is built on top of the JVM's Foreign Function & Memory API, leveraging off-heap memory management to achieve predictable, GC-pause-free execution. It's designed for financial systems, real-time data pipelines, and anywhere nanoseconds matter.

// Example: off-heap buffer allocation in Exeris Kernel
var arena = Arena.ofConfined();
var segment = arena.allocate(ValueLayout.JAVA_LONG, 1024);
segment.setAtIndex(ValueLayout.JAVA_LONG, 0, System.nanoTime());

Calendar vs. Competence

I don't measure my career in calendar years. I measure it in the depth of the stack I've had to dismantle.

While many spend a decade inside the comfort zone of a framework, I spent the last 6 months fighting the JVM Garbage Collector and manual memory management — off-heap arenas, direct MemorySegment access via Project Panama, and the hard lessons of building a zero-allocation runtime from scratch.

That's where real engineering begins. Not in CRUD scaffolding. In the moments when the GC log is your only clue, and latency spikes are measured in microseconds, not seconds.

This blog is the audit trail of that journey.

The Philosophy

The philosophy of this blog mirrors the philosophy of Exeris: No Waste.

No unnecessary client-side JavaScript on this site
No bloated frameworks where a pure function suffices
No ceremony when simplicity delivers the same result

Built with Astro — the framework that ships zero JS by default.

Stay Tuned

Upcoming content includes deep dives into:

GraalVM native image compilation for Spring Boot services
Designing multi-region active-active architectures
The Exeris Kernel internals: lock-free queues and off-heap buffers

Let's build things that last.