If you're calling Java from .NET (or vice versa), you've probably noticed that cross-runtime calls aren't free. The bridge overhead itself is tiny — microseconds per call — but when you're making thousands of calls per request, those microseconds stack up fast.
I've spent a lot of time profiling cross-runtime performance in production systems, and the surprising truth is: the bridge is almost never the bottleneck. GC pauses, chatty call patterns, and object marshaling eat way more time. Here's everything I've learned about making Java/.NET integration fast.
Where Latency Actually Hides
Most teams blame the bridge. In practice, here's the real breakdown:
| Source | Typical Latency | How to Detect |
|---|---|---|
| Bridge call overhead | 1–50µs | Microbenchmark isolated calls |
| Object marshaling/serialization | 10–500µs | Profile with complex objects vs primitives |
| GC pauses (either runtime) | 1–200ms | GC logs (both JVM and CLR) |
| JVM cold start (first call) | 1–5s | Measure first call vs subsequent |
| Class loading (Java) | 10–100ms | Profile with -verbose:class
|
| JIT compilation (both runtimes) | 50–500ms first execution | Warmup timing, tiered compilation logs |
| Thread contention at bridge | Variable | Thread dump analysis, lock profiling |
| Network latency (TCP mode) | 0.1–1ms per call | Switch to shared memory, compare |
Rule of thumb: If your cross-runtime calls are slower than expected, look at GC, class loading, and call patterns first — not the bridge mechanism.
Measure Before You Optimize
Performance tuning without measurement is guessing. Establish baselines first:
- Single call latency — One method call with a primitive parameter. This is your overhead floor.
- Complex call latency — Same call with realistic objects (lists, custom classes). Difference = marshaling cost.
- Throughput — Max calls/sec before latency degrades. Tests concurrency limits.
- P99 latency — The 99th percentile matters more than average. GC pauses cause tail spikes.
- Cold start time — First call after JVM init. Worst-case latency.
Benchmarking Template (C#)
// BenchmarkDotNet setup for cross-runtime calls
[MemoryDiagnoser]
[GcServer(true)]
public class BridgeCallBenchmarks
{
private JavaProxy _proxy;
[GlobalSetup]
public void Setup()
{
_proxy = new JavaProxy();
// Warmup: 1000 calls to trigger JIT on both sides
for (int i = 0; i < 1000; i++)
_proxy.SimpleCall(i);
}
[Benchmark(Baseline = true)]
public int SimpleCall() => _proxy.Add(42, 58);
[Benchmark]
public List<string> ComplexCall() => _proxy.ProcessList(testData);
[Benchmark]
public TradeResult RealWorldCall() => _proxy.ExecuteTrade(sampleTrade);
}
JVM Tuning for Bridge Workloads
Heap Sizing
When the JVM runs inside (or alongside) a .NET process, memory is shared. Set explicit bounds:
# Recommended JVM flags for bridge workloads
-Xms512m # Initial heap (avoid resize delays)
-Xmx1g # Maximum heap (leave room for CLR)
-XX:MaxMetaspaceSize=256m # Cap class metadata
-XX:ReservedCodeCacheSize=128m # JIT compiled code cache
Critical rule: Total JVM heap + CLR managed heap + native overhead must fit in available RAM. In a 4GB container: budget ~1GB for JVM, ~1.5GB for CLR, ~1.5GB for OS and native allocations.
GC Selection
| GC Algorithm | Best For | Bridge Impact |
|---|---|---|
| G1GC (Java 9+ default) | General workloads, 1–16GB heap | Good default. 10–50ms pause target. |
| ZGC | Ultra-low latency, large heaps | Sub-millisecond pauses. Best for latency-sensitive bridges. |
| Shenandoah | Low latency, Red Hat/OpenJDK | Similar to ZGC. Available in OpenJDK builds. |
| Serial GC | Small heaps (<256MB) | Stop-the-world but fast for tiny heaps. |
# For low-latency bridge workloads (Java 17+)
-XX:+UseZGC
-XX:SoftMaxHeapSize=768m # ZGC returns memory below this
-XX:ZCollectionInterval=5 # Proactive GC every 5 seconds
# For general bridge workloads
-XX:+UseG1GC
-XX:MaxGCPauseMillis=20 # Target 20ms max pause
-XX:G1HeapRegionSize=4m # Optimize for your object sizes
JIT Compiler Optimization
# Enable tiered compilation (default in Java 9+)
-XX:+TieredCompilation
# Pre-compile frequently-called bridge methods
-XX:CompileThreshold=100 # Compile after 100 invocations (default: 10000)
# For faster warmup at cost of peak performance:
-XX:TieredStopAtLevel=1 # Skip C2 compiler (faster startup)
CLR and .NET Runtime Tuning
Server GC vs Workstation GC
For bridge workloads, always use Server GC:
{
"runtimeOptions": {
"configProperties": {
"System.GC.Server": true,
"System.GC.Concurrent": true,
"System.GC.HeapHardLimit": 1610612736
}
}
}
Why: Workstation GC runs on a single thread and blocks longer. Server GC uses one thread per core, with shorter pauses. For concurrent bridge calls, Server GC reduces tail latency significantly.
.NET 9 DATAS GC
.NET 9's Dynamic Adaptation to Application Sizes (DATAS) auto-adjusts heap size based on workload — meaning the CLR won't over-allocate when the JVM also needs heap space:
{
"configProperties": {
"System.GC.DynamicAdaptationMode": 1
}
}
Thread Pool Tuning
// Set minimum threads to avoid pool starvation during bridge calls
ThreadPool.SetMinThreads(
workerThreads: Environment.ProcessorCount * 2,
completionPortThreads: Environment.ProcessorCount);
Garbage Collection Coordination
The biggest performance killer: GC pauses in one runtime stalling the other.
When the JVM is in a stop-the-world GC pause, .NET threads waiting for bridge responses are blocked. If the CLR triggers its own GC simultaneously — compounding pause.
Mitigation Strategies
- Use low-pause GCs on both sides — ZGC (Java) + Server GC (.NET) keeps pauses under 1ms
-
Stagger GC timing — Proactive JVM GC during idle periods (
-XX:ZCollectionInterval=5) - Monitor both GC logs simultaneously — Correlate JVM GC events with .NET events to find compounding pauses
- Reduce object allocation at the bridge boundary — Reuse objects, use value types, avoid unnecessary boxing
Enabling GC Logs for Both Runtimes
# JVM GC logging
-Xlog:gc*:file=jvm-gc.log:time,uptime,level,tags:filecount=5,filesize=10m
# .NET GC logging
DOTNET_GCLog=gc-dotnet.log
Optimizing Cross-Runtime Call Patterns
Anti-Pattern: Chatty Calls
// BAD: 1000 individual bridge calls
for (int i = 0; i < orders.Count; i++)
{
var result = javaService.ValidateOrder(orders[i]); // ~10µs each
// 1000 * 10µs = 10ms overhead
}
Pattern: Batch Calls
// GOOD: 1 bridge call with batch data
var results = javaService.ValidateOrders(orders); // ~50µs total
// 50µs vs 10ms = 200x faster
Rule: Every cross-runtime call has fixed overhead. Minimize the number of calls, not the data per call. One call with 1000 items beats 1000 calls with 1 item.
Pattern: Coarse-Grained Interfaces
// BAD: Fine-grained Java API from .NET
var customer = javaProxy.GetCustomer(id);
var address = javaProxy.GetAddress(customer.AddressId);
var orders = javaProxy.GetOrders(customer.Id);
var total = javaProxy.CalculateTotal(orders);
// 4 bridge calls
// GOOD: Coarse-grained facade
var summary = javaProxy.GetCustomerSummary(id);
// 1 bridge call — Java handles the joins internally
Design principle: Create coarse-grained Java facades that batch operations per bridge call. Let Java-to-Java calls happen inside the JVM (zero overhead), and only cross the bridge for the final result.
Pattern: Async Fire-and-Forget
// For non-blocking operations (logging, analytics, cache warming)
Task.Run(() => javaProxy.LogAnalyticsEvent(eventData));
// Don't await — .NET continues immediately
Object Marshaling Optimization
| Data Type | Marshaling Cost | Optimization |
|---|---|---|
| Primitives (int, double, bool) | Negligible | Use directly |
| Strings | Low (UTF-16 both sides) | Avoid unnecessary conversions |
| Arrays of primitives | Low (bulk copy) | Prefer over List<T>
|
| Simple objects (few fields) | Low-Medium | Use DTOs, not full entities |
| Collections (List, Map) | Medium (element-by-element) | Use arrays when possible |
| Deep object graphs | High | Flatten or use DTOs |
| Exceptions | High (stack trace construction) | Use error codes for expected failures |
DTO Pattern for Cross-Runtime Data
// .NET DTO — flat, minimal fields
public record TradeRequest(
string Symbol,
decimal Quantity,
decimal Price,
string Side // "BUY" or "SELL"
);
// Java DTO — mirrors .NET structure
public record TradeRequest(
String symbol,
BigDecimal quantity,
BigDecimal price,
String side
) {}
Key optimizations:
- Keep DTOs flat (no nested objects when avoidable)
- Use primitive types and strings over complex objects
- Avoid passing Java-specific types (HashMap internals, Stream objects) across the bridge
- For large datasets: pass byte arrays and deserialize on the receiving side
Connection and Resource Pooling
JVM Instance Reuse
Never create multiple JVM instances per request:
// Singleton pattern for bridge initialization
public sealed class JavaBridge
{
private static readonly Lazy<JavaBridge> _instance =
new(() => new JavaBridge());
public static JavaBridge Instance => _instance.Value;
private JavaBridge()
{
JNBridge.Initialize(); // One-time (1-3 seconds)
}
}
Object Pooling for Frequently Used Java Objects
private readonly ObjectPool<JavaPdfParser> _parserPool =
new DefaultObjectPool<JavaPdfParser>(
new JavaPdfParserPoolPolicy(), maxRetained: 10);
public byte[] ConvertPdf(byte[] input)
{
var parser = _parserPool.Get();
try { return parser.Convert(input); }
finally { _parserPool.Return(parser); }
}
Profiling Tools
| Tool | Runtime | Best For | Free? |
|---|---|---|---|
| BenchmarkDotNet | .NET | Microbenchmarks, memory allocation | Yes |
| dotnet-trace / dotnet-counters | .NET | Runtime diagnostics, GC events | Yes |
| JDK Flight Recorder (JFR) | Java | Low-overhead production profiling | Yes |
| async-profiler | Java | CPU + allocation profiling, flame graphs | Yes |
| VisualVM | Java | Heap analysis, thread monitoring | Yes |
| OpenTelemetry | Both | Distributed tracing across runtimes | Yes |
| Prometheus + Grafana | Both | Metrics dashboards, alerting | Yes |
Recommended Workflow
- Start with OpenTelemetry tracing — instrument bridge calls with spans
- Enable GC logging on both runtimes — check for correlated pauses
- Run BenchmarkDotNet microbenchmarks — isolate bridge overhead
- Use JFR in production — low overhead (<2%) continuous profiling
- Build a Grafana dashboard — track P50/P95/P99 latency over time
Benchmarks: Before and After
| Scenario | Before | After | Improvement | Technique |
|---|---|---|---|---|
| 1000 individual calls | 10ms | 0.05ms | 200x | Batch call pattern |
| Complex object marshaling | 500µs | 50µs | 10x | DTO flattening |
| P99 latency (GC spikes) | 200ms | 2ms | 100x | ZGC + Server GC |
| Cold start (first call) | 5s | 1.5s | 3.3x | Eager class loading + tiered compilation |
| Concurrent throughput | 5K calls/s | 50K calls/s | 10x | Thread pool tuning + object pooling |
| TCP mode overhead | 0.5ms/call | 5µs/call | 100x | Switch to shared memory mode |
Most impactful: Switching from chatty calls to batch calls. Almost always the biggest win.
FAQ
What's the typical overhead of a JNBridgePro bridge call?
A single call with simple parameters takes 1–50µs in shared memory mode. For comparison, a REST API call to the same method on localhost: 5–50ms — 1000x slower.
Shared memory or TCP mode?
Use shared memory when Java and .NET run on the same machine. It eliminates network latency entirely (5µs vs 0.5ms per call). TCP mode is only for when JVM and CLR are on different machines.
How do I prevent JVM GC from blocking .NET?
Use ZGC (Java 17+) or Shenandoah for sub-millisecond pauses. On .NET, enable Server GC with concurrent mode. Monitor both GC logs.
Can I make bridge calls async?
Bridge calls are synchronous by design (direct method invocation). Wrap in Task.Run() for fire-and-forget, or use a producer-consumer queue where a background thread makes bridge calls.
How many concurrent calls can it handle?
No hard limit. With proper thread pool tuning, production systems handle 50,000+ calls/sec. The bottleneck is almost always business logic, not bridge overhead.
This article was originally published at jnbridge.com.
Top comments (0)