5 Java Flight Recorder Techniques That Transform Performance Debugging in Production Applications

#programming #devto #java #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Let's talk about performance. If you've ever worked on a Java application that felt slow, you've probably asked the classic questions: Why is it slow? Where is it slow? Is it the database? The memory? The code itself? For years, finding answers felt like detective work with poor clues—adding log statements, guessing, and restarting the application with different settings.

I used to do that. I would stare at a slow-running batch job or a web service timing out and start swapping theories with my team. We'd add more memory, tweak a thread pool size, or blame the network. Sometimes we got lucky. Often, we didn't. The real problem remained hidden.

This changed for me when I started using the tools that are now built right into the Java Development Kit. They moved performance work from being a guessing game to a precise exercise in measurement. Today, I want to walk you through how I use them. I'll show you five specific ways I profile applications, using Java Flight Recorder (JFR) and JDK Mission Control (JMC). We'll use plenty of code examples so you can try this yourself.

First, you need to know that these tools are not an extra download for modern JDKs. If you're using JDK 11 or later, especially JDK 17 or 21, they're already there. Flight Recorder is like a black box flight data recorder for your Java application. It runs constantly with very little cost, capturing a huge amount of detail about what the JVM and your code are doing. Mission Control is the analysis software you use to read those recordings.

The biggest shift in thinking is to start recording before there's a problem. You don't wait for the fire to start to install smoke alarms.

Let's begin with the first technique: simply turning it on and capturing data. This is the foundation.

In the old days, profilers would slow your application to a crawl. JFR is designed to have a minimal "overhead," usually less than 1-2%. You can run it in production. The easiest way to start is by adding a command-line flag when you launch your application.

// This is how you start a 60-second recording when the app starts.
// java -XX:StartFlightRecording=filename=recording.jfr,duration=60s,settings=profile -jar myapp.jar

But what if you want more control from within your code? You can do that too. Imagine you have a specific process, like generating a complex report, and you want to profile just that part.

import jdk.jfr.*;
import java.nio.file.Paths;
import java.time.Duration;

public class ReportGenerator {
    public void generateComplexReport() {
        // Define a new recording for just this task
        try (Recording recording = new Recording()) {

            // We can choose which events to capture and how often.
            // This turns on CPU load events, sampled every second.
            recording.enable("jdk.CPULoad").withPeriod(Duration.ofSeconds(1));

            // This captures garbage collection events that take longer than 10ms.
            recording.enable("jdk.GarbageCollection").withThreshold(Duration.ofMillis(10));

            // This captures when threads pause (park), useful for spotting contention.
            recording.enable("jdk.ThreadPark").withThreshold(Duration.ofMillis(10));

            System.out.println("Starting performance recording for report generation...");
            recording.start();

            // This is our actual business logic.
            executeReportLogic();

            recording.stop();

            // Save the recording to a file for later analysis in JDK Mission Control.
            Paths outputPath = Paths.get("report-profile.jfr");
            recording.dump(outputPath);
            System.out.println("Recording saved to: " + outputPath.toAbsolutePath());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private void executeReportLogic() {
        // Simulate heavy work
        for (int i = 0; i < 1000000; i++) {
            // ... some data processing ...
        }
    }
}

When you open the resulting .jfr file in JDK Mission Control, you're not looking at a single number. You're looking at a timeline, graphs of CPU and memory, lists of the slowest methods, and logs of every garbage collection. You can see exactly what was happening at the moment your application slowed down. The guesswork starts to fade away.

The second technique is a game-changer for understanding your own code. JFR captures JVM events, but your application is unique. What if you could see your business logic right there on the same timeline as the garbage collections and CPU spikes? You can, with custom events.

Let's say you have an order processing system. You want to know how long each order takes and if it succeeded. Instead of just logging, you can create a JFR event.

import jdk.jfr.*;

// This annotation defines our custom event.
@Name("com.mycompany.OrderProcess") // A unique identifier
@Label("Order Processing") // A human-readable name in JMC
@Category("Business") // Groups it in the JMC UI
@Description("Tracks the processing of a customer order")
class OrderProcessEvent extends Event {

    @Label("Order ID")
    String orderId;

    @Label("Processing Time (ms)")
    @Timespan(Timespan.MILLISECONDS) // Tells JMC how to display the value
    long processingTime;

    @Label("Success")
    boolean success;

    @Label("Payment Method")
    String paymentMethod;

    // A helper method to make recording the event easy and consistent.
    public static void record(String orderId, long startTimeNanos, boolean success, String paymentMethod) {
        OrderProcessEvent event = new OrderProcessEvent();
        event.orderId = orderId;
        event.processingTime = (System.nanoTime() - startTimeNanos) / 1_000_000; // Convert ns to ms
        event.success = success;
        event.paymentMethod = paymentMethod;

        // This final call writes the event to the active recording.
        event.commit();
    }
}

Now, inside your service code, using it is straightforward.

public class OrderService {
    public void processOrder(Order order) {
        long startTime = System.nanoTime(); // High-precision start time
        boolean success = false;

        try {
            // ... complex business logic: validate, charge, update inventory, notify ...
            chargeCustomer(order);
            updateInventory(order);
            success = true;
        } catch (Exception e) {
            logger.error("Failed to process order " + order.getId(), e);
            success = false;
        } finally {
            // Record the event whether it succeeded or failed.
            OrderProcessEvent.record(order.getId(), startTime, success, order.getPaymentMethod());
        }
    }
}

After a recording session, I open Mission Control. I can see a list of all my Order Processing events. I can calculate the average time, find the outliers, and filter to see if orders with a certain payment method are slower. Most powerfully, I can see on the timeline: "Ah, this batch of slow orders happened right after a major garbage collection that paused all threads for 200 milliseconds." That connection is impossible to make with just logs.

The third technique is hunting for memory leaks. A slow application is bad, but one that gradually consumes all memory until it crashes is a classic nightmare. JFR gives you a way to see the memory story.

The key is to look at allocation pressure and what survives. A healthy application creates short-lived objects. A leaking application creates objects that never go away, filling up the "old" generation of the heap. JFR can track allocations.

Here’s how you might configure a recording to focus on memory:

try (Recording recording = new Recording()) {

    // TLAB stands for Thread-Local Allocation Buffer. Most objects are born here.
    // This event is throttled; we sample 150 allocations per second per thread.
    recording.enable("jdk.ObjectAllocationInNewTLAB").with("throttle", "150/s");

    // Larger objects are allocated outside the TLAB. We sample these too.
    recording.enable("jdk.ObjectAllocationOutsideTLAB").with("throttle", "20/s");

    // Capture a snapshot of heap usage every 5 seconds.
    recording.enable("jdk.GCHeapSummary").withPeriod(Duration.ofSeconds(5));

    recording.start();
    runMyApplicationWorkload();
    recording.stop();
}

In JDK Mission Control, the "Allocations" tab becomes your best friend. You can see which classes are being instantiated the most. You can see the "allocation stack trace"—the exact line of code that created them.

I once tracked a leak to a simple caching class. The graph in JMC showed the CacheEntry class count going steadily up, never down. The allocation stack trace pointed to a loadData method. The problem was the cache had no size limit or expiration. The evidence was so clear; fixing it was the easy part. You look for trends: a line on a graph that only goes up, for a class that should be temporary.

You can also see the direct impact of memory on performance. The "Garbage Collection" page shows you the frequency and duration of GC pauses. If you see "Full GC" events that take several seconds, you know your application is stalling regularly to clean up a crowded heap.

The fourth technique is about finding CPU bottlenecks. Which method is using all the processor time? Traditional "instrumenting" profilers change the code to add timing, which itself slows things down and skews results. JFR uses statistical sampling, which is much lighter.

It works by periodically (say, every 10 milliseconds) looking at what every thread is doing and recording the stack trace. Over time, methods that appear in many samples are the "hot" methods consuming CPU.

Setting this up is simple:

try (Recording recording = new Recording()) {

    // Sample Java method execution every 10ms.
    recording.enable("jdk.ExecutionSample").withPeriod(Duration.ofMillis(10));

    // Also sample time spent in native methods (e.g., in libraries or the OS).
    recording.enable("jdk.NativeMethodSample").withPeriod(Duration.ofMillis(10));

    // Let it run for a while or with a size limit for production.
    recording.setMaxAge(Duration.ofMinutes(10));
    recording.setMaxSize(100 * 1024 * 1024); // 100 MB

    recording.start();
    // Your application runs normally here.
}

In Mission Control's "Method Profiling" view, you get a list of methods sorted by "Sample Count." The top method is your primary suspect. But the real power is in the "Call Tree." It shows you not just that methodA is hot, but that 95% of its time is spent inside methodB, which is doing a string operation in a tight loop.

I remember profiling a data transformation service. The hot method was a convert() function. The call tree showed it spent 70% of its time inside a parseDate() method. The date format pattern was being compiled from a string inside the loop for every single record. Moving the SimpleDateFormat object outside the loop was a one-line change that doubled the throughput. The profiler showed me exactly where to look.

The fifth and most powerful technique is running JFR continuously in production. This is like having a security camera on your application 24/7. When a user calls at 2 AM to report slowness, you have a recording of what happened.

You configure the JVM to always keep a rolling buffer of the last few hours of data.

# Launch your production application with a continuous recording.
java \
  -XX:StartFlightRecording="name=Continuous,disk=true,maxsize=1G,maxage=4h,settings=profile" \
  -jar my-production-app.jar

This command starts a recording that writes to disk, keeps a maximum of 1 GB of data, and only retains the last 4 hours. It uses the "profile" settings (a predefined set tuned for profiling). The overhead remains minimal.

But you can also interact with it programmatically. You can have your application dump a recording automatically when something bad happens.

// A shutdown hook to dump the recording if the app crashes.
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
    System.out.println("Shutdown detected, dumping JFR recording...");
    for (Recording recording : FlightRecorder.getFlightRecorder().getRecordings()) {
        if (recording.isRunning()) {
            String fileName = "incident-dump-" + System.currentTimeMillis() + ".jfr";
            try {
                recording.dump(Paths.get(fileName));
                System.out.println("Saved incident recording to: " + fileName);
            } catch (IOException e) {
                System.err.println("Failed to dump recording: " + e.getMessage());
            }
        }
    }
}));

// You can also trigger a snapshot based on your own application metrics.
// For example, if a request takes too long:
class RequestMonitor {
    private static final Duration SLOW_THRESHOLD = Duration.ofSeconds(5);

    public Response handleRequest(Request req) {
        long start = System.nanoTime();
        try {
            return processRequest(req);
        } finally {
            Duration elapsed = Duration.ofNanos(System.nanoTime() - start);
            if (elapsed.compareTo(SLOW_THRESHOLD) > 0) {
                System.out.println("Slow request detected, taking JFR snapshot.");
                // This captures the state of the continuous recording at this moment.
                FlightRecorder.getFlightRecorder().takeSnapshot();
            }
        }
    }
}

When a performance incident occurs, you connect to the production machine (with appropriate security), run jcmd <PID> JFR.dump name=Continuous filename=incident.jfr, and you get a file containing the minutes or hours leading up to the problem. You can analyze it in Mission Control on your laptop. You'll see thread deadlocks, network timeouts, sudden spikes in allocation, or a specific slow database query—all captured in context.

These five techniques changed how I work. I no longer fear performance issues. I have a plan.

Start recording to get baseline data.
Instrument key business flows with custom events to understand them.
Use allocation profiling to keep memory healthy.
Sample methods regularly to catch CPU regressions.
Run it all continuously in production to capture the unexpected.

The code examples I've shared are starting points. You will adapt them. The goal is to build a clear picture of what your application is doing, using data instead of intuition. Start small. Add a 60-second recording to your next integration test. Create one custom event for your most important feature. Look at the results in JDK Mission Control.

You will be surprised by what you find. Often, the bottleneck is not where you think it is. But now, you have the tools to see it, measure it, fix it, and prove that your fix worked. That is the real power of moving from guessing to knowing.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!