DEV Community

Cover image for Squeezing Microseconds: A Practical Guide to Java/.NET Performance Tuning
JNBridge
JNBridge

Posted on • Originally published at jnbridge.com

Squeezing Microseconds: A Practical Guide to Java/.NET Performance Tuning

If you're calling Java from .NET (or vice versa), you've probably noticed that cross-runtime calls aren't free. The bridge overhead itself is tiny — microseconds per call — but when you're making thousands of calls per request, those microseconds stack up fast.

I've spent a lot of time profiling cross-runtime performance in production systems, and the surprising truth is: the bridge is almost never the bottleneck. GC pauses, chatty call patterns, and object marshaling eat way more time. Here's everything I've learned about making Java/.NET integration fast.

Where Latency Actually Hides

Most teams blame the bridge. In practice, here's the real breakdown:

Source Typical Latency How to Detect
Bridge call overhead 1–50µs Microbenchmark isolated calls
Object marshaling/serialization 10–500µs Profile with complex objects vs primitives
GC pauses (either runtime) 1–200ms GC logs (both JVM and CLR)
JVM cold start (first call) 1–5s Measure first call vs subsequent
Class loading (Java) 10–100ms Profile with -verbose:class
JIT compilation (both runtimes) 50–500ms first execution Warmup timing, tiered compilation logs
Thread contention at bridge Variable Thread dump analysis, lock profiling
Network latency (TCP mode) 0.1–1ms per call Switch to shared memory, compare

Rule of thumb: If your cross-runtime calls are slower than expected, look at GC, class loading, and call patterns first — not the bridge mechanism.

Measure Before You Optimize

Performance tuning without measurement is guessing. Establish baselines first:

  1. Single call latency — One method call with a primitive parameter. This is your overhead floor.
  2. Complex call latency — Same call with realistic objects (lists, custom classes). Difference = marshaling cost.
  3. Throughput — Max calls/sec before latency degrades. Tests concurrency limits.
  4. P99 latency — The 99th percentile matters more than average. GC pauses cause tail spikes.
  5. Cold start time — First call after JVM init. Worst-case latency.

Benchmarking Template (C#)

// BenchmarkDotNet setup for cross-runtime calls
[MemoryDiagnoser]
[GcServer(true)]
public class BridgeCallBenchmarks
{
    private JavaProxy _proxy;

    [GlobalSetup]
    public void Setup()
    {
        _proxy = new JavaProxy();
        // Warmup: 1000 calls to trigger JIT on both sides
        for (int i = 0; i < 1000; i++)
            _proxy.SimpleCall(i);
    }

    [Benchmark(Baseline = true)]
    public int SimpleCall() => _proxy.Add(42, 58);

    [Benchmark]
    public List<string> ComplexCall() => _proxy.ProcessList(testData);

    [Benchmark]
    public TradeResult RealWorldCall() => _proxy.ExecuteTrade(sampleTrade);
}
Enter fullscreen mode Exit fullscreen mode

JVM Tuning for Bridge Workloads

Heap Sizing

When the JVM runs inside (or alongside) a .NET process, memory is shared. Set explicit bounds:

# Recommended JVM flags for bridge workloads
-Xms512m                        # Initial heap (avoid resize delays)
-Xmx1g                          # Maximum heap (leave room for CLR)
-XX:MaxMetaspaceSize=256m       # Cap class metadata
-XX:ReservedCodeCacheSize=128m  # JIT compiled code cache
Enter fullscreen mode Exit fullscreen mode

Critical rule: Total JVM heap + CLR managed heap + native overhead must fit in available RAM. In a 4GB container: budget ~1GB for JVM, ~1.5GB for CLR, ~1.5GB for OS and native allocations.

GC Selection

GC Algorithm Best For Bridge Impact
G1GC (Java 9+ default) General workloads, 1–16GB heap Good default. 10–50ms pause target.
ZGC Ultra-low latency, large heaps Sub-millisecond pauses. Best for latency-sensitive bridges.
Shenandoah Low latency, Red Hat/OpenJDK Similar to ZGC. Available in OpenJDK builds.
Serial GC Small heaps (<256MB) Stop-the-world but fast for tiny heaps.
# For low-latency bridge workloads (Java 17+)
-XX:+UseZGC
-XX:SoftMaxHeapSize=768m    # ZGC returns memory below this
-XX:ZCollectionInterval=5   # Proactive GC every 5 seconds

# For general bridge workloads
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20     # Target 20ms max pause
-XX:G1HeapRegionSize=4m     # Optimize for your object sizes
Enter fullscreen mode Exit fullscreen mode

JIT Compiler Optimization

# Enable tiered compilation (default in Java 9+)
-XX:+TieredCompilation
# Pre-compile frequently-called bridge methods
-XX:CompileThreshold=100    # Compile after 100 invocations (default: 10000)
# For faster warmup at cost of peak performance:
-XX:TieredStopAtLevel=1     # Skip C2 compiler (faster startup)
Enter fullscreen mode Exit fullscreen mode

CLR and .NET Runtime Tuning

Server GC vs Workstation GC

For bridge workloads, always use Server GC:

{
  "runtimeOptions": {
    "configProperties": {
      "System.GC.Server": true,
      "System.GC.Concurrent": true,
      "System.GC.HeapHardLimit": 1610612736
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Why: Workstation GC runs on a single thread and blocks longer. Server GC uses one thread per core, with shorter pauses. For concurrent bridge calls, Server GC reduces tail latency significantly.

.NET 9 DATAS GC

.NET 9's Dynamic Adaptation to Application Sizes (DATAS) auto-adjusts heap size based on workload — meaning the CLR won't over-allocate when the JVM also needs heap space:

{
  "configProperties": {
    "System.GC.DynamicAdaptationMode": 1
  }
}
Enter fullscreen mode Exit fullscreen mode

Thread Pool Tuning

// Set minimum threads to avoid pool starvation during bridge calls
ThreadPool.SetMinThreads(
    workerThreads: Environment.ProcessorCount * 2,
    completionPortThreads: Environment.ProcessorCount);
Enter fullscreen mode Exit fullscreen mode

Garbage Collection Coordination

The biggest performance killer: GC pauses in one runtime stalling the other.

When the JVM is in a stop-the-world GC pause, .NET threads waiting for bridge responses are blocked. If the CLR triggers its own GC simultaneously — compounding pause.

Mitigation Strategies

  1. Use low-pause GCs on both sides — ZGC (Java) + Server GC (.NET) keeps pauses under 1ms
  2. Stagger GC timing — Proactive JVM GC during idle periods (-XX:ZCollectionInterval=5)
  3. Monitor both GC logs simultaneously — Correlate JVM GC events with .NET events to find compounding pauses
  4. Reduce object allocation at the bridge boundary — Reuse objects, use value types, avoid unnecessary boxing

Enabling GC Logs for Both Runtimes

# JVM GC logging
-Xlog:gc*:file=jvm-gc.log:time,uptime,level,tags:filecount=5,filesize=10m

# .NET GC logging
DOTNET_GCLog=gc-dotnet.log
Enter fullscreen mode Exit fullscreen mode

Optimizing Cross-Runtime Call Patterns

Anti-Pattern: Chatty Calls

// BAD: 1000 individual bridge calls
for (int i = 0; i < orders.Count; i++)
{
    var result = javaService.ValidateOrder(orders[i]);  // ~10µs each
    // 1000 * 10µs = 10ms overhead
}
Enter fullscreen mode Exit fullscreen mode

Pattern: Batch Calls

// GOOD: 1 bridge call with batch data
var results = javaService.ValidateOrders(orders);  // ~50µs total
// 50µs vs 10ms = 200x faster
Enter fullscreen mode Exit fullscreen mode

Rule: Every cross-runtime call has fixed overhead. Minimize the number of calls, not the data per call. One call with 1000 items beats 1000 calls with 1 item.

Pattern: Coarse-Grained Interfaces

// BAD: Fine-grained Java API from .NET
var customer = javaProxy.GetCustomer(id);
var address = javaProxy.GetAddress(customer.AddressId);
var orders = javaProxy.GetOrders(customer.Id);
var total = javaProxy.CalculateTotal(orders);
// 4 bridge calls

// GOOD: Coarse-grained facade
var summary = javaProxy.GetCustomerSummary(id);
// 1 bridge call — Java handles the joins internally
Enter fullscreen mode Exit fullscreen mode

Design principle: Create coarse-grained Java facades that batch operations per bridge call. Let Java-to-Java calls happen inside the JVM (zero overhead), and only cross the bridge for the final result.

Pattern: Async Fire-and-Forget

// For non-blocking operations (logging, analytics, cache warming)
Task.Run(() => javaProxy.LogAnalyticsEvent(eventData));
// Don't await — .NET continues immediately
Enter fullscreen mode Exit fullscreen mode

Object Marshaling Optimization

Data Type Marshaling Cost Optimization
Primitives (int, double, bool) Negligible Use directly
Strings Low (UTF-16 both sides) Avoid unnecessary conversions
Arrays of primitives Low (bulk copy) Prefer over List<T>
Simple objects (few fields) Low-Medium Use DTOs, not full entities
Collections (List, Map) Medium (element-by-element) Use arrays when possible
Deep object graphs High Flatten or use DTOs
Exceptions High (stack trace construction) Use error codes for expected failures

DTO Pattern for Cross-Runtime Data

// .NET DTO — flat, minimal fields
public record TradeRequest(
    string Symbol,
    decimal Quantity,
    decimal Price,
    string Side  // "BUY" or "SELL"
);
Enter fullscreen mode Exit fullscreen mode
// Java DTO — mirrors .NET structure
public record TradeRequest(
    String symbol,
    BigDecimal quantity,
    BigDecimal price,
    String side
) {}
Enter fullscreen mode Exit fullscreen mode

Key optimizations:

  • Keep DTOs flat (no nested objects when avoidable)
  • Use primitive types and strings over complex objects
  • Avoid passing Java-specific types (HashMap internals, Stream objects) across the bridge
  • For large datasets: pass byte arrays and deserialize on the receiving side

Connection and Resource Pooling

JVM Instance Reuse

Never create multiple JVM instances per request:

// Singleton pattern for bridge initialization
public sealed class JavaBridge
{
    private static readonly Lazy<JavaBridge> _instance = 
        new(() => new JavaBridge());

    public static JavaBridge Instance => _instance.Value;

    private JavaBridge()
    {
        JNBridge.Initialize(); // One-time (1-3 seconds)
    }
}
Enter fullscreen mode Exit fullscreen mode

Object Pooling for Frequently Used Java Objects

private readonly ObjectPool<JavaPdfParser> _parserPool = 
    new DefaultObjectPool<JavaPdfParser>(
        new JavaPdfParserPoolPolicy(), maxRetained: 10);

public byte[] ConvertPdf(byte[] input)
{
    var parser = _parserPool.Get();
    try { return parser.Convert(input); }
    finally { _parserPool.Return(parser); }
}
Enter fullscreen mode Exit fullscreen mode

Profiling Tools

Tool Runtime Best For Free?
BenchmarkDotNet .NET Microbenchmarks, memory allocation Yes
dotnet-trace / dotnet-counters .NET Runtime diagnostics, GC events Yes
JDK Flight Recorder (JFR) Java Low-overhead production profiling Yes
async-profiler Java CPU + allocation profiling, flame graphs Yes
VisualVM Java Heap analysis, thread monitoring Yes
OpenTelemetry Both Distributed tracing across runtimes Yes
Prometheus + Grafana Both Metrics dashboards, alerting Yes

Recommended Workflow

  1. Start with OpenTelemetry tracing — instrument bridge calls with spans
  2. Enable GC logging on both runtimes — check for correlated pauses
  3. Run BenchmarkDotNet microbenchmarks — isolate bridge overhead
  4. Use JFR in production — low overhead (<2%) continuous profiling
  5. Build a Grafana dashboard — track P50/P95/P99 latency over time

Benchmarks: Before and After

Scenario Before After Improvement Technique
1000 individual calls 10ms 0.05ms 200x Batch call pattern
Complex object marshaling 500µs 50µs 10x DTO flattening
P99 latency (GC spikes) 200ms 2ms 100x ZGC + Server GC
Cold start (first call) 5s 1.5s 3.3x Eager class loading + tiered compilation
Concurrent throughput 5K calls/s 50K calls/s 10x Thread pool tuning + object pooling
TCP mode overhead 0.5ms/call 5µs/call 100x Switch to shared memory mode

Most impactful: Switching from chatty calls to batch calls. Almost always the biggest win.

FAQ

What's the typical overhead of a JNBridgePro bridge call?
A single call with simple parameters takes 1–50µs in shared memory mode. For comparison, a REST API call to the same method on localhost: 5–50ms — 1000x slower.

Shared memory or TCP mode?
Use shared memory when Java and .NET run on the same machine. It eliminates network latency entirely (5µs vs 0.5ms per call). TCP mode is only for when JVM and CLR are on different machines.

How do I prevent JVM GC from blocking .NET?
Use ZGC (Java 17+) or Shenandoah for sub-millisecond pauses. On .NET, enable Server GC with concurrent mode. Monitor both GC logs.

Can I make bridge calls async?
Bridge calls are synchronous by design (direct method invocation). Wrap in Task.Run() for fire-and-forget, or use a producer-consumer queue where a background thread makes bridge calls.

How many concurrent calls can it handle?
No hard limit. With proper thread pool tuning, production systems handle 50,000+ calls/sec. The bottleneck is almost always business logic, not bridge overhead.

This article was originally published at jnbridge.com.

Top comments (0)