JNBridge

Posted on Mar 10 • Originally published at jnbridge.com

Squeezing Microseconds: A Practical Guide to Java/.NET Performance Tuning

#dotnet #java #performance #csharp

If you're calling Java from .NET (or vice versa), you've probably noticed that cross-runtime calls aren't free. The bridge overhead itself is tiny — microseconds per call — but when you're making thousands of calls per request, those microseconds stack up fast.

I've spent a lot of time profiling cross-runtime performance in production systems, and the surprising truth is: the bridge is almost never the bottleneck. GC pauses, chatty call patterns, and object marshaling eat way more time. Here's everything I've learned about making Java/.NET integration fast.

Where Latency Actually Hides

Most teams blame the bridge. In practice, here's the real breakdown:

Source	Typical Latency	How to Detect
Bridge call overhead	1–50µs	Microbenchmark isolated calls
Object marshaling/serialization	10–500µs	Profile with complex objects vs primitives
GC pauses (either runtime)	1–200ms	GC logs (both JVM and CLR)
JVM cold start (first call)	1–5s	Measure first call vs subsequent
Class loading (Java)	10–100ms	Profile with `-verbose:class`
JIT compilation (both runtimes)	50–500ms first execution	Warmup timing, tiered compilation logs
Thread contention at bridge	Variable	Thread dump analysis, lock profiling
Network latency (TCP mode)	0.1–1ms per call	Switch to shared memory, compare

Rule of thumb: If your cross-runtime calls are slower than expected, look at GC, class loading, and call patterns first — not the bridge mechanism.

Measure Before You Optimize

Performance tuning without measurement is guessing. Establish baselines first:

Single call latency — One method call with a primitive parameter. This is your overhead floor.
Complex call latency — Same call with realistic objects (lists, custom classes). Difference = marshaling cost.
Throughput — Max calls/sec before latency degrades. Tests concurrency limits.
P99 latency — The 99th percentile matters more than average. GC pauses cause tail spikes.
Cold start time — First call after JVM init. Worst-case latency.

Benchmarking Template (C#)

// BenchmarkDotNet setup for cross-runtime calls
[MemoryDiagnoser]
[GcServer(true)]
public class BridgeCallBenchmarks
{
    private JavaProxy _proxy;

    [GlobalSetup]
    public void Setup()
    {
        _proxy = new JavaProxy();
        // Warmup: 1000 calls to trigger JIT on both sides
        for (int i = 0; i < 1000; i++)
            _proxy.SimpleCall(i);
    }

    [Benchmark(Baseline = true)]
    public int SimpleCall() => _proxy.Add(42, 58);

    [Benchmark]
    public List<string> ComplexCall() => _proxy.ProcessList(testData);

    [Benchmark]
    public TradeResult RealWorldCall() => _proxy.ExecuteTrade(sampleTrade);
}

JVM Tuning for Bridge Workloads

Heap Sizing

When the JVM runs inside (or alongside) a .NET process, memory is shared. Set explicit bounds:

# Recommended JVM flags for bridge workloads
-Xms512m                        # Initial heap (avoid resize delays)
-Xmx1g                          # Maximum heap (leave room for CLR)
-XX:MaxMetaspaceSize=256m       # Cap class metadata
-XX:ReservedCodeCacheSize=128m  # JIT compiled code cache

Critical rule: Total JVM heap + CLR managed heap + native overhead must fit in available RAM. In a 4GB container: budget ~1GB for JVM, ~1.5GB for CLR, ~1.5GB for OS and native allocations.

GC Selection

GC Algorithm	Best For	Bridge Impact
G1GC (Java 9+ default)	General workloads, 1–16GB heap	Good default. 10–50ms pause target.
ZGC	Ultra-low latency, large heaps	Sub-millisecond pauses. Best for latency-sensitive bridges.
Shenandoah	Low latency, Red Hat/OpenJDK	Similar to ZGC. Available in OpenJDK builds.
Serial GC	Small heaps (<256MB)	Stop-the-world but fast for tiny heaps.

# For low-latency bridge workloads (Java 17+)
-XX:+UseZGC
-XX:SoftMaxHeapSize=768m    # ZGC returns memory below this
-XX:ZCollectionInterval=5   # Proactive GC every 5 seconds

# For general bridge workloads
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20     # Target 20ms max pause
-XX:G1HeapRegionSize=4m     # Optimize for your object sizes

JIT Compiler Optimization

# Enable tiered compilation (default in Java 9+)
-XX:+TieredCompilation
# Pre-compile frequently-called bridge methods
-XX:CompileThreshold=100    # Compile after 100 invocations (default: 10000)
# For faster warmup at cost of peak performance:
-XX:TieredStopAtLevel=1     # Skip C2 compiler (faster startup)

CLR and .NET Runtime Tuning

Server GC vs Workstation GC

For bridge workloads, always use Server GC:

{
  "runtimeOptions": {
    "configProperties": {
      "System.GC.Server": true,
      "System.GC.Concurrent": true,
      "System.GC.HeapHardLimit": 1610612736
    }
  }
}

Why: Workstation GC runs on a single thread and blocks longer. Server GC uses one thread per core, with shorter pauses. For concurrent bridge calls, Server GC reduces tail latency significantly.

.NET 9 DATAS GC

.NET 9's Dynamic Adaptation to Application Sizes (DATAS) auto-adjusts heap size based on workload — meaning the CLR won't over-allocate when the JVM also needs heap space:

{
  "configProperties": {
    "System.GC.DynamicAdaptationMode": 1
  }
}

Thread Pool Tuning

// Set minimum threads to avoid pool starvation during bridge calls
ThreadPool.SetMinThreads(
    workerThreads: Environment.ProcessorCount * 2,
    completionPortThreads: Environment.ProcessorCount);

Garbage Collection Coordination

The biggest performance killer: GC pauses in one runtime stalling the other.

When the JVM is in a stop-the-world GC pause, .NET threads waiting for bridge responses are blocked. If the CLR triggers its own GC simultaneously — compounding pause.

Mitigation Strategies

Use low-pause GCs on both sides — ZGC (Java) + Server GC (.NET) keeps pauses under 1ms
Stagger GC timing — Proactive JVM GC during idle periods (-XX:ZCollectionInterval=5)
Monitor both GC logs simultaneously — Correlate JVM GC events with .NET events to find compounding pauses
Reduce object allocation at the bridge boundary — Reuse objects, use value types, avoid unnecessary boxing

Enabling GC Logs for Both Runtimes

# JVM GC logging
-Xlog:gc*:file=jvm-gc.log:time,uptime,level,tags:filecount=5,filesize=10m

# .NET GC logging
DOTNET_GCLog=gc-dotnet.log

Optimizing Cross-Runtime Call Patterns

Anti-Pattern: Chatty Calls

// BAD: 1000 individual bridge calls
for (int i = 0; i < orders.Count; i++)
{
    var result = javaService.ValidateOrder(orders[i]);  // ~10µs each
    // 1000 * 10µs = 10ms overhead
}

Pattern: Batch Calls

// GOOD: 1 bridge call with batch data
var results = javaService.ValidateOrders(orders);  // ~50µs total
// 50µs vs 10ms = 200x faster

Rule: Every cross-runtime call has fixed overhead. Minimize the number of calls, not the data per call. One call with 1000 items beats 1000 calls with 1 item.

Pattern: Coarse-Grained Interfaces

// BAD: Fine-grained Java API from .NET
var customer = javaProxy.GetCustomer(id);
var address = javaProxy.GetAddress(customer.AddressId);
var orders = javaProxy.GetOrders(customer.Id);
var total = javaProxy.CalculateTotal(orders);
// 4 bridge calls

// GOOD: Coarse-grained facade
var summary = javaProxy.GetCustomerSummary(id);
// 1 bridge call — Java handles the joins internally

Design principle: Create coarse-grained Java facades that batch operations per bridge call. Let Java-to-Java calls happen inside the JVM (zero overhead), and only cross the bridge for the final result.

Pattern: Async Fire-and-Forget

// For non-blocking operations (logging, analytics, cache warming)
Task.Run(() => javaProxy.LogAnalyticsEvent(eventData));
// Don't await — .NET continues immediately

Object Marshaling Optimization

Data Type	Marshaling Cost	Optimization
Primitives (int, double, bool)	Negligible	Use directly
Strings	Low (UTF-16 both sides)	Avoid unnecessary conversions
Arrays of primitives	Low (bulk copy)	Prefer over `List<T>`
Simple objects (few fields)	Low-Medium	Use DTOs, not full entities
Collections (List, Map)	Medium (element-by-element)	Use arrays when possible
Deep object graphs	High	Flatten or use DTOs
Exceptions	High (stack trace construction)	Use error codes for expected failures

DTO Pattern for Cross-Runtime Data

// .NET DTO — flat, minimal fields
public record TradeRequest(
    string Symbol,
    decimal Quantity,
    decimal Price,
    string Side  // "BUY" or "SELL"
);

// Java DTO — mirrors .NET structure
public record TradeRequest(
    String symbol,
    BigDecimal quantity,
    BigDecimal price,
    String side
) {}

Key optimizations:

Keep DTOs flat (no nested objects when avoidable)
Use primitive types and strings over complex objects
Avoid passing Java-specific types (HashMap internals, Stream objects) across the bridge
For large datasets: pass byte arrays and deserialize on the receiving side

Connection and Resource Pooling

JVM Instance Reuse

Never create multiple JVM instances per request:

// Singleton pattern for bridge initialization
public sealed class JavaBridge
{
    private static readonly Lazy<JavaBridge> _instance = 
        new(() => new JavaBridge());

    public static JavaBridge Instance => _instance.Value;

    private JavaBridge()
    {
        JNBridge.Initialize(); // One-time (1-3 seconds)
    }
}

Object Pooling for Frequently Used Java Objects

private readonly ObjectPool<JavaPdfParser> _parserPool = 
    new DefaultObjectPool<JavaPdfParser>(
        new JavaPdfParserPoolPolicy(), maxRetained: 10);

public byte[] ConvertPdf(byte[] input)
{
    var parser = _parserPool.Get();
    try { return parser.Convert(input); }
    finally { _parserPool.Return(parser); }
}

Profiling Tools

Tool	Runtime	Best For	Free?
BenchmarkDotNet	.NET	Microbenchmarks, memory allocation	Yes
dotnet-trace / dotnet-counters	.NET	Runtime diagnostics, GC events	Yes
JDK Flight Recorder (JFR)	Java	Low-overhead production profiling	Yes
async-profiler	Java	CPU + allocation profiling, flame graphs	Yes
VisualVM	Java	Heap analysis, thread monitoring	Yes
OpenTelemetry	Both	Distributed tracing across runtimes	Yes
Prometheus + Grafana	Both	Metrics dashboards, alerting	Yes

Recommended Workflow

Start with OpenTelemetry tracing — instrument bridge calls with spans
Enable GC logging on both runtimes — check for correlated pauses
Run BenchmarkDotNet microbenchmarks — isolate bridge overhead
Use JFR in production — low overhead (<2%) continuous profiling
Build a Grafana dashboard — track P50/P95/P99 latency over time

Benchmarks: Before and After

Scenario	Before	After	Improvement	Technique
1000 individual calls	10ms	0.05ms	200x	Batch call pattern
Complex object marshaling	500µs	50µs	10x	DTO flattening
P99 latency (GC spikes)	200ms	2ms	100x	ZGC + Server GC
Cold start (first call)	5s	1.5s	3.3x	Eager class loading + tiered compilation
Concurrent throughput	5K calls/s	50K calls/s	10x	Thread pool tuning + object pooling
TCP mode overhead	0.5ms/call	5µs/call	100x	Switch to shared memory mode

Most impactful: Switching from chatty calls to batch calls. Almost always the biggest win.

FAQ

What's the typical overhead of a JNBridgePro bridge call?
A single call with simple parameters takes 1–50µs in shared memory mode. For comparison, a REST API call to the same method on localhost: 5–50ms — 1000x slower.

Shared memory or TCP mode?
Use shared memory when Java and .NET run on the same machine. It eliminates network latency entirely (5µs vs 0.5ms per call). TCP mode is only for when JVM and CLR are on different machines.

How do I prevent JVM GC from blocking .NET?
Use ZGC (Java 17+) or Shenandoah for sub-millisecond pauses. On .NET, enable Server GC with concurrent mode. Monitor both GC logs.

Can I make bridge calls async?
Bridge calls are synchronous by design (direct method invocation). Wrap in Task.Run() for fire-and-forget, or use a producer-consumer queue where a background thread makes bridge calls.

How many concurrent calls can it handle?
No hard limit. With proper thread pool tuning, production systems handle 50,000+ calls/sec. The bottleneck is almost always business logic, not bridge overhead.

This article was originally published at jnbridge.com.

DEV Community