Juan Torchia

Posted on May 28 • Originally published at juanchi.dev

The benchmark that made me change my mind about Jakarta EE in 2026

#english #performance #backend #railway

The first table made me uncomfortable: on my machine, with the lab’s realistic workload, Embedded GlassFish seemed to beat Spring Boot. If I had published then, the post would have had more punch, but it would also have been methodologically weak. I paused and added what was missing: a supported JDK for all, separate warmup, longer measured windows, fixed heap, explicit pool settings, and pg_stat_statements to attribute database cost. With that, the conclusion changed.

This post does not try to decide who "wins forever." It tells how my read changed when the benchmark became fair. And why, if I start a greenfield today with a team that already lives in Spring, I still choose Spring Boot; but if I’m in an organization with Payara/Jakarta, I try Payara Micro; and if there’s Jakarta code that wants a light executable, Embedded GlassFish enters the conversation.

Why I ran this experiment

I had an easy-to-repeat idea: "Spring Boot is always the obvious choice." I wanted to challenge it with evidence, not intuition.
I’m interested in modern Jakarta EE without nostalgia. I wanted to see if there’s real room today, not in 2012.
I avoided Hello World. I built a small but realistic, DB-heavy API with reads, writes, aggregations, and mixed load.
The editorial goal is simple: decide with defensible measurements, not folklore.

The system I implemented

The lab’s domain was shipment-intelligence, the same API served on three runtimes: Spring Boot, Embedded GlassFish, and Payara Micro.

PostgreSQL with a large deterministic dataset (100k shipments).
Tracking read by trackingId.
Operational summaries (route and volumes).
Paginated delayed shipments.
Real event ingestion into the database.
Health/readiness.
k6 as the load generator with shared scenarios.
Measurement of RSS, GC logs, runtime stdout/stderr, and pg_stat_statements to understand DB cost.

I don’t show code here. What matters for this post is that all three versions implement the same HTTP contract and point to the same database, with the same k6 scenarios.

How the conclusion changed as I improved the methodology

The narrative turn in this lab is explained with two snapshots: Phase 2 and Phase 4. The first is the temptation to publish quickly. The second is when the experiment becomes defensible.

Phase 2 table (realistic operational benchmark, 3 runs per runtime)

Runtime	Median p50	Median p95	Median p99	Median throughput
Embedded GlassFish	4.66 ms	58.77 ms	111.85 ms	86.46 req/s
Payara Micro	16.32 ms	135.76 ms	238.61 ms	71.17 req/s
Spring Boot	36.59 ms	340.50 ms	594.74 ms	53.36 req/s

Honest reading of Phase 2: if I had stopped there, the easy headline was "GlassFish is back." But too many things were missing: no pg_stat_statements, I didn’t capture RSS per run, samples were short, I didn’t separate warmup, the JDK wasn’t uniform, and pools weren’t all declared the same. It was a good base to continue, not to close the topic.

Phase 3 added causality and complexity (VU 10/25/50/100, three runs per combination, DB attribution, RSS before/after, GC logs). GlassFish stayed strong in tail latency at higher VUs, Payara fought for throughput, Spring Boot held with lower RSS. But a key warning appeared: in some runs Payara complained about an unsupported JDK. I needed a fairer phase.

Phase 4: the fair benchmark (basis of the post)

Here is the snapshot that matters for telling the story. Controls:

Temurin 21.0.10 for all.
Fixed heap: -Xms512m -Xmx512m.
Separate warmup and a 180s measured window.
Explicit pool settings.
pg_stat_statements reset after warmup.
Three runs per runtime/VU, with VUs 25 and 100.

Main Phase 4 table

Runtime	VUs	Runs	Median p50	Median p95	Median p99	Median throughput	Worst error rate	Check failures	Median RSS before
Spring Boot	25	3	4.59 ms	66.92 ms	110.03 ms	213.13 req/s	0.01%	2	517.5 MB
Payara Micro	25	3	33.10 ms	188.16 ms	336.77 ms	156.48 req/s	0.00%	0	694.3 MB
Embedded GlassFish	25	3	38.03 ms	198.83 ms	371.96 ms	151.26 req/s	0.00%	0	579.1 MB
Spring Boot	100	3	149.36 ms	341.69 ms	473.41 ms	372.56 req/s	0.04%	25	543.0 MB
Payara Micro	100	3	204.61 ms	588.31 ms	870.53 ms	284.29 req/s	0.00%	0	715.7 MB
Embedded GlassFish	100	3	320.12 ms	540.00 ms	677.23 ms	229.28 req/s	0.01%	5	593.9 MB

Editorial read of Phase 4 (limited to this local workload and my machine):

At 25 VUs, Spring Boot was clearly ahead in median latency and throughput, with lower relative RSS within the fixed heap.
At 100 VUs, Spring Boot also had better p95/p99 and median throughput. The cost was recording check failures: 25 in the 100 VU set and 2 at 25 VUs. I’m not hiding it because it also speaks to the system under pressure.
Payara Micro was the "cleanest" Jakarta EE by check failures in Phase 4: 0 at 25 and 0 at 100 VUs. In throughput it came second and had the lowest p50 among the Jakarta group at 100 VUs, albeit with higher RSS.
Embedded GlassFish remained viable and technically interesting, but stopped leading when the method became stricter.

How the database explained part of the story

With pg_stat_statements it was clear this lab is DB-heavy. The analytical aggregations (route/volumes) dominated the latency tail under pressure. Tracking read, in contrast, was cheap. That doesn’t prove the difference comes "only from the runtime." It shows the comparison happens in a system where PostgreSQL, the pool, JDBC, k6 in Docker, and the host also matter. It’s the kind of fine-tuning I want to see before slapping on a headline.

The developer experience (brief and honest)

Spring Boot was the fastest to iterate. It’s not an absolute merit of the framework; it’s the reality of a small team that already lives there. Config, packaging, health/readiness, and observability landed almost without thinking.
Payara Micro felt pragmatic if there is already a WAR/Jakarta culture. In Phase 4 runs it was impeccable on check failures. It required more log interpretation and runtime details.
Embedded GlassFish was the surprise. It got me closer to a Jakarta EE executable lighter than I expected. It didn’t win the final phase, but it made me revisit my biases.

Mini evolution map (from "it seems like" to "fair conclusion")

Phase 2: GlassFish seemed to be the winner of the realistic workload.
Phase 3: GlassFish strong in tail latency, Spring with lower RSS, Payara competitive; Payara’s JDK unsupported in part of the runs.
Phase 4: with Temurin 21, fixed heap, warmup, and long windows, Spring Boot ended up with the best local latency/throughput profile; Payara Micro with zero check failures was the cleanest Jakarta EE; GlassFish remained viable.
Phase 5: external smoke on Railway, useful for portability, not for performance.

Railway as smoke, not as a podium

On 2026-05-25 I reproduced a smoke on Railway: all three runtimes deployed against a disposable PostgreSQL, passed /ready, tracking read, and a minimal k6 (1 VU / 10s) without check failures. That’s enough for me to say "this moves outside my machine" and it matches how I’ve been operating juanchi.dev on Railway. I don’t use it to infer production performance.

Phase 2 vs Phase 4 table (what changed when the benchmark was fair)

Phase	Quick read	What was missing or added	Who ended up better positioned
Phase 2	GlassFish seemed to lead in p95/throughput	No pg_stat_statements, no separate warmup, short samples, non-uniform JDK, pools not explicit	GlassFish (apparent), but with incomplete method
Phase 4	Supported JDK, fixed heap, warmup, 180s window, explicit pools, DB attribution	Yes to everything that was missing	Spring Boot in local latency/throughput; Payara Micro with zero check failures; GlassFish viable

Decision tree (what I take into practice)

Greenfield with a team that already knows Spring: Spring Boot. Reasons: lower adoption friction, ecosystem, observability, hiring, and in this lab the better local Phase 4 profile.
Organization with Payara/Jakarta/WAR already in place: try Payara Micro before proposing a migration. In the lab it was the cleanest Jakarta EE under pressure (zero check failures) and competitive in throughput.
Jakarta code seeking a lighter executable and not needing a full app server: evaluate Embedded GlassFish. It’s more viable than many think and can be the bridge without a full rewrite.
Migration discussion driven by performance: run your own benchmark with that system’s real workload. A post is not enough (not even this one).
If the decision is dominated by operability, integrations, and hiring: Spring Boot tends to reduce risk for teams like mine.

Honest limits (not to blow smoke)

A single workstation for Phase 4.
DB-heavy workload; does not isolate pure runtime.
No Kafka, PostGIS, native image, Kubernetes, or autoscaling.
No long soak test.
Phase 5 is external smoke, not a performance matrix.
The developer experience is biased by prior familiarity with Spring Boot.
Jakarta runtimes’ logs require interpretation and that should be stated, not hidden.
Spring Boot’s check failures in Phase 4 are preserved and mentioned; the exact root cause was not fully proven in that session.

What I would change if I repeat this lab tomorrow

Even longer pressure runs (and a multi-hour soak) to capture slow variation.
Replication on another machine or a CI runner to eliminate local noise.
Full capture of k6 console and stderr/stdout already automated in the harness.
One more step toward equal pool tuning (Hikari everywhere with the same fine-grained policies) and PostgreSQL connection limits to see if the queue moves.
A version with a more CPU-bound workload (fewer heavy aggregations) to isolate runtime/serializer.

How this fits with my current work

Day to day I build Java/Spring Boot backends in a small team that delivers digital identity, biometrics, signing, and storage. There’s a lot of pressure to ship and to operate with confidence. That’s why, although modern Jakarta EE feels viable to me (and after this lab it feels even more so), for greenfield I choose Spring Boot. The marginal cost to get into productive mode and operational clarity still matter. At the same time, if I arrive at a client with Payara in production and stable WARs, today I have evidence to say "let’s try Payara Micro and/or Embedded GlassFish before planning a full rewrite."

What truly surprised me (the eureka moment)

The eureka was when I saw that, with Temurin 21 for all, fixed heap, and a serious warmup, the ranking changed. It wasn’t that Spring "magically became faster"; it was that the comparison got organized. And that the dominant factor of p99 under pressure was in the database, not in an if in the framework. From there, the debate stops being religious and becomes architectural: what am I really measuring? what do I want to optimize? which trade-off suits this team?

What the briefs changed in this post

This post did not come from a single generation or a pretty table. I treated it as an editorial package: first I built the experiment, then I wrote briefs to separate evidence, allowed claims, forbidden claims, and limits. That changed the final text quite a bit.

The briefs forced me to stop three times:

Not publish Phase 2 as the truth, even though it had more punch, because the fairness controls were still incomplete.
Not hide Spring Boot's check failures in Phase 4: if they are in the evidence, they need to be in the post.
Not sell Railway as a production benchmark: Phase 5 was external smoke and portability, not a performance podium.

The traceability is now public in the repo: enterprise-runtime-lab. The canonical tag for the published state is runtime-lab-final. I also kept the editorial brief, the evidence map, and the Railway replication note.

For me this is the most important part of the process: the brief was not bureaucracy. It was the mechanism that kept the article from becoming a framework fight. The real story is not “Spring won.” The real story is “the conclusion changed when the benchmark stopped being convenient and started being defensible.”

Publishing notes and traceability

This post is backed by public evidence. The lab is versioned on GitHub, with canonical tag runtime-lab-final and final commit d176ed6. The phase tags preserve how the methodology changed: scaffold, baseline, realistic benchmark, causal analysis, fairness matrix, Railway smoke, and final.

My conclusion (debatable, but with numbers alongside)

If I were starting a new product today with a team that already knows Spring, I’d use Spring Boot. Not because "Jakarta EE isn’t good," but because the combination of local performance in this lab, memory, developer experience, documentation, integrations, and operation carries weight. If the organization already has Jakarta EE/Payara/GlassFish, I pause before proposing a rewrite: Payara Micro and Embedded GlassFish don’t win by default, but they deserve a serious trial with the real workload. The most important result is not "runtime X won"; it’s that migration decisions should be tested against the real workload, not against intuition or generic benchmarks.

I’ll close with an open question: if tomorrow you had to decide in your team, would you run your own benchmark first or bet on intuition? What would you do with your real workload and your delivery pressure?

This article was originally published on juanchi.dev

DEV Community

The benchmark that made me change my mind about Jakarta EE in 2026

Top comments (0)