Ghofrane WECHCRIA

Posted on Apr 17

OrqueIO Benchmarking

#orqueio #camunda7fork #bpmn #workfloworchestration

A deep dive into load testing the open-source Camunda 7 fork with Gatling, Kubernetes, and Prometheus

When your business relies on a BPMN engine to orchestrate critical processes, performance is not just a nice-to-have. It is essential. Every millisecond of latency, every failed request, every resource bottleneck translates directly into business impact.

That is why we put OrqueIO through rigorous performance testing. The results speak for themselves.

What is OrqueIO?

OrqueIO is a production-ready, open-source fork of Camunda 7, the industry-standard BPMN 2.0 workflow engine trusted by enterprises worldwide. It was born from the need for long-term support and continuous improvement without costly migrations. OrqueIO preserves full compatibility with existing Camunda 7 processes while delivering enhanced security, stability, and performance.

For teams facing the end of Camunda 7 support or seeking a vendor-independent BPMN solution, OrqueIO offers a seamless path forward.

What is Gatling?

Before diving into the results, let me explain our testing tool. Gatling is an open-source load testing framework written in Scala. Unlike traditional tools that simulate users with threads (which gets expensive at scale), Gatling uses an asynchronous, non-blocking architecture. This means a single Gatling instance can simulate thousands of concurrent users without breaking a sweat.

Why we chose Gatling:

Code-based simulations: Tests are written in Java, making them version-controllable and easy to integrate into CI/CD pipelines
Realistic load patterns: You can ramp up users gradually, hold steady load, or create complex traffic patterns that mirror real-world usage
Beautiful reports: Gatling generates detailed HTML reports with response time distributions, percentiles, and throughput charts
Low resource footprint: Thanks to its async architecture, you can generate massive load from modest hardware

Our test scenario is straightforward: we repeatedly call the OrqueIO REST API to start new process instances. This is the most common and performance-critical operation in any workflow engine.

POST /engine-rest/process-definition/key/{processKey}/start

Each request creates a new BPMN process instance, persists it to the database, and returns the instance ID. Simple, but it exercises the full stack.

The Testing Setup

We designed our load testing infrastructure to simulate real-world production conditions.

Kubernetes Environment:

Amazon EKS cluster
All components deployed in the same namespace for simplified networking
PostgreSQL database running in EKS for consistent latency

Monitoring Stack:

Prometheus for metrics collection
Grafana Alloy for scraping container metrics
Real-time CPU and memory tracking during test execution

Load Testing Pipeline:

Gatling packaged in a Docker container
Jenkins for orchestration and parameterized test runs
Automatic report upload to S3

The entire setup is automated. We can spin up a test, adjust pod counts, change resource limits, and compare results across configurations with just a few clicks.

The Testing Infrastructure

We built a fully automated testing pipeline that anyone can replicate. Here is how all the pieces fit together:

The Flow:

Trigger: A developer kicks off a Jenkins build, either manually or on a schedule.
Parameters: Jenkins prompts for test configuration. How many pod replicas? What CPU and memory limits? How long should the test run? How many users per second?
Build: Jenkins builds the Gatling Docker image and pushes it to Amazon ECR.
Scale: Before the test starts, Jenkins uses kubectl to scale the OrqueIO deployment to the desired replica count and applies the specified resource limits.
Deploy: Jenkins deploys the Gatling container as a Kubernetes Job in the same namespace as OrqueIO.
Execute: The Gatling pod hammers the OrqueIO REST API with HTTP requests. OrqueIO processes each request and persists data to PostgreSQL.
Monitor: Meanwhile, Grafana Alloy scrapes CPU and memory metrics from all pods and pushes them to Prometheus.
Collect: When the test finishes, the Gatling pod queries Prometheus for resource usage during the test window, bundles everything into a report, and uploads it to S3.

Gatling in Docker: We packaged Gatling in a container along with scripts that automatically query Prometheus during test execution. When a test finishes, we capture not just the Gatling metrics but also the actual CPU and memory consumption of the target pods.

Jenkins Pipeline: Our Jenkins jobs accept parameters for pod count, CPU limits, memory limits, and test duration. This lets us quickly run comparison tests across different configurations without manual intervention.

Prometheus Integration: During each test, we collect container metrics in real time:

CPU usage via container_cpu_usage_seconds_total
Memory usage via container_memory_usage_bytes
Per-pod breakdowns for debugging hotspots

Everything gets uploaded to S3 at the end of each run, so we have a historical record of every test.

Benchmarking:

We ran three tests with different configurations to understand how OrqueIO behaves under various conditions:

Test 1: Single Pod Baseline
Configuration:

1 Pod replica
CPU: 500m request / 2000m limit
Memory: 512Mi request / 1Gi limit
Load: 10 requests per second for 1 minute (600)

Benchmark:
Gatling Report:

Interpretation:

Resource Consumption:

CPU: 0.031 cores average
Memory: 323 MB stable

What this tells us:

The single pod handled all 600 requests with zero failures. The minimum response time of 37ms represents the true baseline , when a request hits the pod with no GC activity. That is impressive for a full BPMN create, database persist, and response cycle.

The interesting story is in the tail latency. Look at the gap between P95 (72ms) and P99 (478ms) : a 6x jump. This is the classic signature of JVM garbage collection. Most requests fly through quickly, but roughly 1 in 100 gets caught waiting for GC.

The resource numbers tell us something important: the pod used only 0.031 cores (6% of requested) and memory sat stable at 323 MB. This pod is not struggling with capacity. It is struggling with GC pauses that happen regardless of load.

Test 2: Horizontal Scaling with 3 Pods

Configuration:

3 Pod replicas
CPU: 500m request / 2000m limit per pod
Memory: 512Mi request / 1Gi limit per pod
Load: 10 requests per second for 1 minute (600)

Benchmark:
Gatling Report:

Interpretation:

Resource Consumption:

CPU: 0.101 cores total (about 0.034 per pod)
Memory: 330 MB per pod

What this tells us:

We tripled the pods but kept the same load. The improvements go far beyond what simple load distribution would explain.

The minimum dropped from 37ms to 34ms : even the best-case benefits from reduced contention. But the real story is tail latency: P99 went from 478ms to just 60ms, an 87% improvement. Maximum response time fell from 758ms to 116ms.

Why? When one pod pauses for GC, the load balancer routes requests to the other two pods. The probability that all three pause simultaneously is tiny.

Most importantly, standard deviation collapsed from 68ms to just 7ms. The system became predictable. Each pod used about 0.034 cores and 330 MB — nearly identical to the single pod test. We did not waste resources. We simply distributed the workload and let each JVM breathe.

Test 3: Sustained Load (2-Hour Endurance Test)

Configuration:

1 Pod replica
CPU: 500m request / 2000m limit per pod
Memory: 512Mi request / 2Gi limit per pod
Load: 10 requests per second for 2 hours

Benchmark:
Gatling Report:

Interpretation:

Response Time Distribution:

Under 800ms: 71,971 requests (99.96%)
800ms to 1200ms: 28 requests (0.04%)
Over 1200ms: 1 request (0.001%)
Failed: 0 requests (0%)

Resource Consumption:

CPU: 0.040 cores per pod average
Memory: 320 MB per pod average, 450 MB per pod peak

What this tells us:

This is the real test. Two hours of sustained load, 72,000 process instances created, and not a single failure. That is production-grade reliability.

The latency profile stayed remarkably consistent with the short test. Minimum response time actually improved to 32ms, and P99 held steady at 65ms. Even after 2 hours, 99.96% of requests completed in under 800ms.

The 1509ms maximum and the 29 requests over 800ms represent rare GC pauses — about 0.04% of traffic. For most applications, this is perfectly acceptable.

Memory tells an interesting story. Each pod averages around 320 MB. Under sustained load, the heap grows as request objects are created, peaking at around 450 MB per pod, then shrinks after garbage collection runs. This sawtooth pattern is exactly how a healthy Java application behaves. No leaks, no OOM kills, no memory pressure. The 2Gi limit gave plenty of headroom for GC to work efficiently.

Key Performance Insights

1. GC Pauses are Solvable Through Architecture

The gap between P95 and P99 in single-pod deployments reveals the cost of garbage collection. But you do not need exotic GC tuning. Running multiple pods lets the load balancer route around pauses naturally.

2. Horizontal Scaling Pays Compound Dividends

Tripling pods gave us more than 3x benefit in tail latency. You get fewer pauses (less load per JVM) AND better handling of remaining pauses (redundancy). These benefits compound.

3. Memory is Stable and Predictable

All tests showed consistent memory usage around 320–330 MB per pod, peaking at 450 MB under load. No memory leaks, no accumulating state. A 512Mi request with 2Gi limit is appropriately sized.

4. CPU is Not the Bottleneck

At 0.03 cores per pod, CPU utilization is negligible. This is typical for IO-bound applications. You can pack OrqueIO pods densely on your nodes.

5. The Database Keeps Up

With response times in the 40–50ms range including database round trips, PostgreSQL has plenty of headroom. Our PostgreSQL pod (500m CPU, 512Mi memory) handled 72,000 inserts over 2 hours without breaking a sweat. The latency spikes come from JVM pauses, not database contention.

Conclusion

OrqueIO proves that open-source BPMN engines can deliver enterprise-grade performance. With sub-50ms median latency, linear scaling characteristics, and predictable resource consumption, it is a compelling choice for organizations seeking a sustainable, high-performance workflow automation platform.

The transition from Camunda 7 to OrqueIO is not just about maintaining compatibility. It is about building on a solid foundation with a clear path forward.

Resources: