Real-Time Production Performance Profiling: Practical Methods for JVM Applications Under Load

#programming #devto #java #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

When our applications are running in production, they face realities we can't fully replicate in testing. Real traffic, real data, and real infrastructure create unique performance puzzles. I've spent years piecing these puzzles together, and I want to share a practical set of methods that work. This isn't about lab benchmarks; it's about understanding what happens when your code meets the world.

Let's talk about looking at performance without slowing everything down. Imagine trying to understand a busy intersection by watching it for one second every minute. You get a useful picture without stopping traffic. That's the idea behind low-overhead sampling.

A tool I often use takes snapshots of what every thread is doing at regular intervals, say every 10 milliseconds. It doesn't record every single method call—that would be overwhelming and slow. Instead, it builds a statistical model. If a particular method appears in 30% of your snapshots, you can be confident it's using about 30% of your CPU time. The beauty is the cost; you might only add 1-2% overhead, which is fine for most production systems.

Here’s how I might integrate this into an application to allow on-demand profiling. An API endpoint can trigger a profile capture for a set duration when we suspect an issue.

// A simple REST controller to manage profiling sessions
@RestController
@RequestMapping("/internal/profile")
public class ProfileCaptureController {

    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
    private String currentSessionId = null;

    @PostMapping("/start")
    public ResponseEntity<String> startProfile(@RequestParam(defaultValue = "60") int durationSeconds) {
        if (currentSessionId != null) {
            return ResponseEntity.badRequest().body("A profile session is already running.");
        }

        currentSessionId = "profile-" + System.currentTimeMillis();
        String outputFile = "/app/profiles/" + currentSessionId + ".jfr";

        try {
            // Connect to the JVM of the current process
            String pid = String.valueOf(ProcessHandle.current().pid());
            VirtualMachine vm = VirtualMachine.attach(pid);

            // Load the profiling agent. The exact path will depend on your deployment.
            vm.loadAgent("/app/async-profiler/async-profiler.so",
                    "start,event=cpu,file=" + outputFile + ",jfrsync");
            vm.detach();

            // Schedule the session to stop automatically
            scheduler.schedule(() -> this.stopProfile(), durationSeconds, TimeUnit.SECONDS);

            return ResponseEntity.ok("Profiling started. Session ID: " + currentSessionId);

        } catch (Exception e) {
            currentSessionId = null;
            return ResponseEntity.internalServerError().body("Failed to start: " + e.getMessage());
        }
    }

    @PostMapping("/stop")
    public ResponseEntity<String> stopProfile() {
        if (currentSessionId == null) {
            return ResponseEntity.badRequest().body("No active profile session.");
        }

        try {
            String pid = String.valueOf(ProcessHandle.current().pid());
            VirtualMachine vm = VirtualMachine.attach(pid);
            vm.loadAgent("/app/async-profiler/async-profiler.so", "stop,jfrsync");
            vm.detach();

            String session = currentSessionId;
            currentSessionId = null;

            // Here you could trigger analysis or alerting
            log.info("Profile session {} completed. File is ready for analysis.", session);
            return ResponseEntity.ok("Profiling stopped for session: " + session);

        } catch (Exception e) {
            return ResponseEntity.internalServerError().body("Failed to stop: " + e.getMessage());
        }
    }
}

This approach gives us a controlled way to capture data during a performance incident without needing a full restart or complex configuration changes.

The Java runtime itself has a powerful built-in feature for continuous observation. Think of it as a flight data recorder for your application. It's always running in the background with minimal impact, keeping a rotating buffer of the last hour of activity. When something goes wrong, you have the data from the moments leading up to the issue.

You can configure what to record. You might track CPU usage, garbage collection pauses, threads waiting on locks, and even your own custom events. The overhead is so manageable—often less than 1%—that I leave it on all the time in production.

Let me show you how I set this up programmatically. I like to have more control than just startup flags.

// A service to manage the continuous JFR recording
@Service
public class FlightRecorderService {

    private Recording continuousRecording;

    @EventListener(ApplicationReadyEvent.class)
    public void startContinuousRecording() {
        continuousRecording = new Recording();
        continuousRecording.setName("Production-Recording");

        // Keep the last 30 minutes of data in memory
        continuousRecording.setMaxAge(Duration.ofMinutes(30));
        // Or keep it under 150 MB of disk/memory
        continuousRecording.setMaxSize(150 * 1024 * 1024);

        // Enable key events with sensible intervals
        continuousRecording.enable("jdk.CPULoad").withPeriod(Duration.ofSeconds(5));
        continuousRecording.enable("jdk.GarbageCollection");
        // Only record thread locks that take more than 20ms
        continuousRecording.enable("jdk.JavaMonitorEnter").withThreshold(Duration.ofMillis(20));

        continuousRecording.start();
        log.info("Continuous JFR recording started.");
    }

    public Path captureSnapshot(String reason) {
        if (continuousRecording == null) {
            throw new IllegalStateException("Recording not active");
        }

        Path snapshotFile = Paths.get("/app/profiles/snapshot-" + 
                                       Instant.now().toString().replace(":", "_") + 
                                       "-" + reason + ".jfr");

        continuousRecording.dump(snapshotFile);
        log.info("JFR snapshot captured to {} due to: {}", snapshotFile, reason);
        return snapshotFile;
    }

    @PreDestroy
    public void stop() {
        if (continuousRecording != null) {
            continuousRecording.stop();
            continuousRecording.close();
        }
    }
}

The real power comes when you add your own events. This connects generic JVM metrics to your specific business logic.

// Defining a custom event for a business operation
@Name("com.myapp.OrderProcessing")
@Label("Order Processing")
@Description("Tracks the duration and outcome of processing an order")
class OrderProcessEvent extends jdk.jfr.Event {

    @Label("Order ID")
    String orderId;

    @Label("Customer Tier")
    String customerTier;

    @Label("Processing Time MS")
    @Timespan(Timespan.MILLISECONDS)
    long processingTimeMs;

    @Label("Success")
    boolean success;
}

// Using the event in your service
@Service
public class OrderService {

    public OrderResult processOrder(Order order) {
        OrderProcessEvent event = new OrderProcessEvent();
        event.orderId = order.getId();
        event.customerTier = order.getCustomer().getTier();
        event.begin(); // Start the timer

        try {
            // ... your complex order logic ...
            OrderResult result = executeOrderLogic(order);
            event.success = true;
            return result;

        } catch (Exception e) {
            event.success = false;
            throw e;
        } finally {
            event.end(); // Stop the timer
            event.commit(); // Write the event to the recording
        }
    }
}

Now, when you analyze the recording, you can see not just that the JVM was busy, but that it was busy processing orders for "premium" customers, and perhaps that tier has a specific performance profile.

One of the most common surprises in production is memory behavior. An application might work perfectly with test datasets, but create and discard objects at a monstrous rate under real load. This churn puts constant pressure on the garbage collector, leading to pauses that stall your application. Finding what is being allocated, and where, is crucial.

The continuous recording we already set up can be configured to sample allocations. It won't catch every single object—that would be too heavy—but it will show you the patterns and the hot paths.

Sometimes, you need to be more targeted. If I suspect a particular service method is allocating too much, I might add a direct measurement.

@Repository
public class ProductRepository {

    public List<Product> findProducts(Filter filter) {
        // Get a rough baseline of memory used before the operation
        long memoryBefore = getCurrentHeapUsed();

        List<Product> products = entityManager.createQuery("...", Product.class)
                                              .getResultList();

        long memoryAfter = getCurrentHeapUsed();
        long allocatedApprox = memoryAfter - memoryBefore;

        if (allocatedApprox > 5 * 1024 * 1024) { // More than 5MB
            log.warn("Large allocation in findProducts. Filter: {}, Allocated ~{} bytes, Returned {} products.", 
                     filter, allocatedApprox, products.size());
        }

        return products;
    }

    private long getCurrentHeapUsed() {
        return ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed();
    }
}

This is a coarse measurement, but it's effective for spotting outliers. Once you know a method is a heavy allocator, you can investigate further. Often, the fix is straightforward: reusing a StringBuilder, initializing an ArrayList with the correct capacity, or caching a frequently created object.

// A common allocation hotspot and a simple fix
public class MessageBuilder {

    // Inefficient version - allocates a new StringBuilder on every call
    public String buildMessageOld(String user, String action) {
        StringBuilder sb = new StringBuilder(); // Allocation here
        sb.append("User '").append(user).append("' performed action: ").append(action);
        return sb.toString();
    }

    // More efficient version - reuses a thread-local StringBuilder
    private static final ThreadLocal<StringBuilder> REUSABLE_BUILDER = 
            ThreadLocal.withInitial(() -> new StringBuilder(1024)); // Pre-sized

    public String buildMessageNew(String user, String action) {
        StringBuilder sb = REUSABLE_BUILDER.get();
        sb.setLength(0); // Clear it for reuse

        sb.append("User '").append(user).append("' performed action: ").append(action);
        String message = sb.toString();
        return message;
    }
}

Allocation profiling often leads to these simple, high-impact optimizations that smooth out garbage collection and improve overall responsiveness.

Applications slow down not just because of CPU or memory, but because threads are waiting. They're waiting for a database, waiting for a lock, waiting for another service. Under light load, these waits are short. Under production load, they can cascade into major delays.

The first step is visibility. You need to know how many threads are running, waiting, or blocked. A simple periodic check can alert you to deteriorating conditions.

@Component
public class SimpleThreadWatchdog {

    @Scheduled(fixedDelay = 10000) // Run every 10 seconds
    public void checkThreadHealth() {
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        ThreadInfo[] allThreads = threadBean.dumpAllThreads(false, false);

        int runnableCount = 0;
        int blockedCount = 0;
        int waitingCount = 0;

        for (ThreadInfo info : allThreads) {
            switch (info.getThreadState()) {
                case RUNNABLE: runnableCount++; break;
                case BLOCKED: 
                    blockedCount++;
                    log.debug("Blocked thread: {} waiting for lock {}", 
                              info.getThreadName(), info.getLockInfo());
                    break;
                case WAITING:
                case TIMED_WAITING: waitingCount++; break;
            }
        }

        // Alert if too many threads are not making progress
        if (blockedCount > 10) {
            log.error("High thread contention: {} threads are BLOCKED.", blockedCount);
            // Trigger a JFR snapshot for deeper analysis
            jfrService.captureSnapshot("high_thread_blocked");
        }

        // Log a periodic health summary
        log.info("Thread State - Runnable: {}, Blocked: {}, Waiting/Timed: {}", 
                 runnableCount, blockedCount, waitingCount);
    }
}

When the watchdog alerts you to lock contention, JFR's built-in events can show you the exact code paths and stack traces where threads are getting stuck. The key is to look for locks that are held for a long time, or where many threads are waiting for the same lock.

I often find the problem is not a traditional synchronized block, but contention on concurrent structures or database connections.

// An example of a contended resource and a potential mitigation
@Service
public class InventoryCache {

    private final Map<String, Integer> inventoryMap = new ConcurrentHashMap<>();
    // A shared, contended executor
    private final ExecutorService reportingExecutor = Executors.newSingleThreadExecutor();

    // This single-threaded executor can become a bottleneck
    public void updateAndReportOld(String itemId, int change) {
        inventoryMap.compute(itemId, (k, v) -> (v == null ? 0 : v) + change);

        // All calling threads wait here if the single reporter thread is busy
        reportingExecutor.submit(() -> sendToReportingService(itemId, change));
    }

    // Mitigation: Use a non-blocking, multi-producer approach
    private final BlockingQueue<ReportTask> reportQueue = new LinkedBlockingQueue<>();
    private final ExecutorService multiWorkerExecutor = Executors.newFixedThreadPool(4);

    @PostConstruct
    public void init() {
        // Start workers that drain the queue
        for (int i = 0; i < 4; i++) {
            multiWorkerExecutor.submit(this::reportingWorker);
        }
    }

    public void updateAndReportNew(String itemId, int change) {
        inventoryMap.compute(itemId, (k, v) -> (v == null ? 0 : v) + change);

        // This offer operation is rarely contended and returns immediately
        reportQueue.offer(new ReportTask(itemId, change));
    }

    private void reportingWorker() {
        while (true) {
            try {
                ReportTask task = reportQueue.take(); // Efficiently waits for work
                sendToReportingService(task.itemId, task.change);
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
                break;
            }
        }
    }

    private record ReportTask(String itemId, int change) {}
}

Thread profiling shifts your focus from "why is the CPU high?" to "why aren't we making progress?" It reveals the hidden serialization points in your architecture.

All the technical data in the world is useless if you can't connect it to a business outcome. Knowing a method is slow is good. Knowing it's slowing down the checkout process for your highest-value customers is what drives action.

This is where correlation comes in. We propagate a unique identifier through every step of a user's request.

// A filter to set up correlation for every HTTP request
@WebFilter("/*")
public class CorrelationFilter implements Filter {

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) 
            throws IOException, ServletException {

        HttpServletRequest httpReq = (HttpServletRequest) request;
        HttpServletResponse httpRes = (HttpServletResponse) response;

        // Get or generate a correlation ID
        String correlationId = httpReq.getHeader("X-Correlation-ID");
        if (correlationId == null || correlationId.isBlank()) {
            correlationId = "req_" + UUID.randomUUID().toString().substring(0, 8);
        }

        // Store it for the duration of this request
        MDC.put("correlationId", correlationId);

        // Add it to the response so the caller can track it
        httpRes.setHeader("X-Correlation-ID", correlationId);

        // Time the whole request
        long startTime = System.nanoTime();
        try {
            chain.doFilter(request, response);
        } finally {
            long durationMs = (System.nanoTime() - startTime) / 1_000_000;
            String path = httpReq.getRequestURI();

            // Log with correlation ID and performance data
            log.info("Request completed. correlationId={}, path={}, durationMs={}, status={}", 
                     correlationId, path, durationMs, httpRes.getStatus());

            MDC.clear(); // Clean up the thread-local storage
        }
    }
}

Now, every log message from that request carries the correlationId. More powerfully, you can tag your performance metrics with it.

// Using Micrometer to record metrics with business context
@Service
public class PaymentService {

    private final MeterRegistry meterRegistry;
    private final Timer paymentProcessingTimer;

    public PaymentService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
        this.paymentProcessingTimer = Timer.builder("app.payments.process")
                .description("Time to process a payment")
                .publishPercentiles(0.95, 0.99) // Track tail latency
                .register(meterRegistry);
    }

    public Receipt processPayment(PaymentRequest request) {
        // Create tags based on business context
        Tags contextTags = Tags.of(
            "payment.method", request.getMethod(),
            "customer.country", request.getCustomerCountry(),
            "amount.bracket", getAmountBracket(request.getAmount())
        );

        // Time the operation with those tags
        return paymentProcessingTimer.record(contextTags, () -> {
            return executePayment(request); // Your actual logic
        });
    }

    private String getAmountBracket(BigDecimal amount) {
        if (amount.compareTo(new BigDecimal("100")) < 0) return "small";
        if (amount.compareTo(new BigDecimal("1000")) < 0) return "medium";
        return "large";
    }
}

With this in place, you can ask specific questions: "What is the 99th percentile latency for 'large' payments from 'CountryX'?" When you capture a JFR snapshot during a slow period, you can search for events that have the same correlationId as a slow payment you see in your metrics dashboard. This connects the dots from a user complaint, to a business metric, to a specific line of code in a specific JVM.

This is the goal of production profiling. It's not about fancy tools for their own sake. It's about building a clear line of sight from what your users experience to what your code is doing. You start with broad, low-cost sampling to know where to look. You use continuous recording to have data when you need it. You investigate memory and threads to find the root cause of slowness. And you tie it all to business context so you fix what matters most.

The code examples I've shared are starting points. Adapt them. Start simple. Maybe just add the correlation filter and log a few key timers. The most important step is to begin observing your application in its natural habitat. You'll be surprised what you learn when you start listening to it under real load. The stories it tells will guide your most impactful optimizations.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!