Darius Juodokas

Posted on Sep 2, 2022

Stack Trace / Thread Dump Analysis

#performance #analysis #stack #threaddump

Over my career as a Performance Engineer, I've had the pleasure of working with a number of mid-senior IT specialists (developers included). In many cases I had to explain how come I see a problem in application runtime when they don't - to not to be discarded as a person who is imagining things and wasting their precious time. And quite often my findings are based on StackTraces. Unfortunately, many software developers, when given a StackTrace, did not even know where to begin to look at it. Then it struck me: StackTraces is such an informative data structure and so easy to obtain - it's a shame many specialists do not know how to deal with them.

This post is aimed at you, fellow developers and performance specialists, who want to familiarize yourself with StackTrace analysis, or just to refresh your memory on how to deal with stacks.

What's what

Stack trace (ST, STrace)

A stack trace (ST) is a snapshot of what the process is currently doing. An ST does not reveal any data, secrets, or variables. An ST, however, reveals classes (in some cases) and methods/functions being called.

To better understand an ST consider a simple application:

function assertAuthorized() {
    user = db.findUser(request.getUsername(), hash(request.getPassword()));
    assert user != null;
}

function prepare() {
    assertVariables();
    assertAuhorized();
}

function doJob() {
    prepare();
    task1();
}

If we execute doJob() and capture an ST by the moment it called prepare() → request.getUserName(), we'll get an ST that looks similar to this:

Thread main
  at HttpRequest.getUserName
  at prepare
  at doJob

At the very top of the ST is a header, which, in this case, tells us what's the thread name (main). Under the header is the actual call trace starting with the most recent call that this process has made: getUsername. The getusername function has been called by the prepare(), so that's the next line in the trace. And prepare() has been called by doJob(), so that's what we find immediately below prepare.

Reading the ST bottom-up we can see how "deep" the process currently is. In this particular example, the process has called doJob() (at doJob), inside doJob it called prepare() (at prepare), and inside prepare it called getUsername() of the HttpRequest class. We don't see the db.findUser() call in the trace because it hasn't been made yet. In order to make that call, a process has to first obtain the username and the password hash and push them to the stack.

A different way to visualize a ST would be horizontal rather than vertical. What is the thread currently doing?

doJob() → prepare() → HttpRequest.getUsername()

It's exactly the same thing as before, just laid out differently.

N.B. process/thread stack variables are NOT captured in the stack trace.

Some runtimes can embed more info in a stack trace, e.g. a line number of the source file each function is found on, a library with or without its version that function is a part of, PID/TID/NTID (ProcessID/ThreadID/NativeThreadID), resources' utilization (thread execution time, amount of CPU% consumed by that thread), synchronization primitives engaged (monitors), etc. These additions are all varying from runtime to runtime, from version to version and should not be assumed to always be available.

Thread dump (TD, TDump)

Often times Stack Trace and Thread Dump are used interchangeably. While there is a slight difference, STs and TDs are very alike - almost the same thing even.

Stack trace reflects a sequence of calls a single process or thread is currently making - that's the trace part. These calls, called instruction pointers in low-level languages, are stored in a process memory structure, called stack -- that's the stack part. Sometimes StackTraces are also called CallTraces, although a CallTrace is a more generic term and does not focus on threads/processes/stacks explicitly.

A ThreadDump is a bundle of STs for each thread the application is running. Probably with some additional information, like the timestamp of capture, perhaps some statistics and other related diagnostic info. And that's pretty much the whole difference about it. Capture a bunch of STs - one per each thread the application is running - and jam all of them in a single file. There, you have yourself a ThreadDump!

Levels of stack traces

All the processes have a stack trace. Perhaps with the sole exception of processes in the Z state (i.e. Zombie processes), all the processes are executing some function at any given time. As soon as the last remaining (or the first one that the process ran when it was started) function returns, the process terminates. However, in some runtimes, a process ST might not be the call trace we actually need. Consider the Java applications. They all run on a JVM - a virtual machine of sorts that parses and translates the code to native binary instructions a processor is capable to understand. If we find a JVM processes PID and peek at its call trace, we'd see some gibberish and not a single line would resemble anything java. That's because JVM processes are binary - they are NOT Java. These processes are the actual JVM processes - the ones that run the JVM on their shoulders. And our application is running inside of them. So in order to see our Java threads' call traces, we don't capture process StackTraces. Instead, we use JRE's tools, like jstack, that connect to the JVM and capture your application threads' call traces from inside those native processes - from inside the Java Virtual Machine.

Pretty much the same thing applies to any interpreted or pseudo-interpreted language. The native processes are running the mechanism that runs your application, so if you capture their STs, you won't see what your application is doing; instead, you will see what your runtime engine is doing. To see what's happening inside it and what are your application threads' call traces, use tools provided by your language's runtime engine. You can get stack traces even for bash (see: bthrow)!

What's inside an ST/TD

a strict sequence of method/function calls a process/thread has made
the oldest call is at the bottom
the newest (most recent) call is at the top

What is NOT inside an ST/TD

operations
- while, for, switch, if, other built-into-syntax operations
- arithmetic operations (+ - * / ^ etc.)
- assignment operations (=)
- comparison operations (== > < >= <= !=)
data
- variables
- constants
- objects/instances
- function/method parameter values

When

When troubleshooting applications and their performance, it's a very healthy habit to think about an application from two perspectives: nouns and verbs

nouns - this is the data of the application: variables, buffers, strings, bytes, arrays, objects, heap, memory, memory leaks, RAM -- all these keywords refer to DATA. You care about data when you know you have something wrong with the data: either too much of it (e.g. memory leaks), or the data is incorrect (integrity errors). For these cases, you need either a coredump (for native processes) or a heapdump (e.g. for JVM). Data has nothing to do with threads or call/stack traces.
verbs - these are the actions of the application: functions, methods, procedures, routines, lambdas. These are loaded into the stacks of processes/threads in a very strict sequence, that allows us to trace down the flow of execution.

When it comes to performance, we often see an application misbehave:

functions hanging
threads appear not responding
processes/threads burning CPU at 100%
threads locking each other
slow responses
huge amounts of files/sockets open
huge amounts of file/socket IO operations
...

In all those cases we need either an ST or a TD. Only when we have a snapshot of all the threads/processes of the application's actions (nouns) we are able to tell what is happening inside of it and what is the application doing. In the top of call traces of each of the process/threads in the application, we will see what is the last function/method call each method/fn has made (or is making, as that would be more technically correct). Doing down the stack we would see where that call originates from. This way we can trace down WHICH and WHY are our precious application's processes/threads doing all those things.

How to capture

A Stack Trace / Thread Dump is a snapshot of the application's activity. Normally we don't need to keep an eye on STs. Only when an application is misbehaving do we care to inspect its execution. That being said, capturing STs is only useful at the time of misbehaviour. Not only that, but to have a useful StackTrace we must capture it at the very moment the application is having issues. Not before and definitely not after the application problem goes away. It's like doing an x-ray: we need it done while we are ill, not before and not after the illness. Sometimes it's really hard to capture STs at the right moment, because some issues last only for a second, or even less. ST capturing can be automated and integrated into monitoring/alerting systems as a reactive-diagnostic measure to increase the chances of capturing the right data, or we could be capturing a series of STs every second (or less; or more - depending on the issue), hoping to have at least one ST that reflects the problem.

Capturing a series of STs has another benefit. If these STs are captured close enough to each other (seconds/milliseconds apart), we could treat that bundle as a profile. Just like it takes 25 frames per second to project on a screen and trick our eyes (mind) into thinking that's a live motion, we could take several STs per second and assume this is a "live application's motion", i.e. in a way we could track application's actions in time. The accuracy of such tracking gets worse the further these snapshots are apart. More on that later.

Native (process)

Linux

Getting a StackTrace of a process in Linux is quite easy. However, there's something worth noting: you can access 2 stack traces for any Linux process (though on some distros it's not possible): one in the kernelspace and another in the userspace. That's because each process runs in both regions: it does all the calculations and decisions in the userspace, but when it comes to hardware-related things (IO, sleeps, memory, etc.), a process has to ask an OperatingSystem to do the job for it. And an OS does its part in the kernelspace. To make matters simpler, each process gets a share of memory assigned for stack in both the areas, so it could freely execute user code and kernel code.

Normally, when a process asks a kernel to do something, it becomes unresponsive to the userpace (feels frozen). So if your process is not responding -- have a look at its kernel stack. Otherwise - see its stack in the userspace.

To see a process stack you will need that processes PID.

find a PID of the process you're interested in
- ps -ef | grep "<search string>" -- PID will be in the 2nd column
- ps -efL | grep "<search string>"-- PID will be in the 2nd column; the -L argument makes ps list not only processes but also threads
- pgrep "<search string>" -- PID will be the only value in the output

Now the procedure to access the kernel stack and user stack is a little different.

For a kernelspace stack, run sudo cat /proc/${PID}/stack and you will get a nice stack trace. A sleep 9999 stack looks like this:

[<0>] hrtimer_nanosleep+0x99/0x120
[<0>] common_nsleep+0x44/0x50
[<0>] __x64_sys_clock_nanosleep+0xc6/0x130
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

(notice the nsleep and nanosleep calls - it's the OS making the actual sleep call)

and a process currently running in userspace will not be executing anything in the kernel space, so its stack will look similar to

[<0>] ep_poll+0x2aa/0x370
[<0>] do_epoll_wait+0xb2/0xd0
[<0>] __x64_sys_epoll_wait+0x60/0x100
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

(notice the wait and poll calls - the OS is waiting for tasks to run in the kernelspace).

For a userspace stack, you have 2 basic options:
- use the gdb debugger to attach to a running process (gdb -p ${pid} and run bt inside)
- install tools that can automate that for you, like pstack or eu-stack (comes with the elfutils package)

pstack is usually preinstalled and it's very straightforward to use it -- just give it a PID and see the stack appear in your console. However, I prefer eu-stack, because it has more options, like thread printing, verbosity and other nice features (see eu-stack --help).

~ $ eu-stack -p 715821
PID 715821 - process
TID 715821:
#0  0x00007f75b5c991b4 clock_nanosleep@@GLIBC_2.17
#1  0x00007f75b5c9eec7 __nanosleep
#2  0x0000555a72b09827
#3  0x0000555a72b09600
#4  0x0000555a72b067b0
#5  0x00007f75b5be0083 __libc_start_main
#6  0x0000555a72b0687e

or with all the fancy options on:

~ $ eu-stack -aismdvp 715821
PID 715821 - process
TID 715821:
#0  0x00007f75b5c991b4     __GI___clock_nanosleep - /usr/lib/x86_64-linux-gnu/libc-2.31.so
    ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78:5
#1  0x00007f75b5c9eec7 - 1 __GI___nanosleep - /usr/lib/x86_64-linux-gnu/libc-2.31.so
    /build/glibc-SzIz7B/glibc-2.31/posix/nanosleep.c:27:13
#2  0x0000555a72b09827 - 1 - /usr/bin/sleep
#3  0x0000555a72b09600 - 1 - /usr/bin/sleep
#4  0x0000555a72b067b0 - 1 - /usr/bin/sleep
#5  0x00007f75b5be0083 - 1 __libc_start_main - /usr/lib/x86_64-linux-gnu/libc-2.31.so
    ../csu/libc-start.c:308:16
#6  0x0000555a72b0687e - 1 - /usr/bin/sleep

The nice thing about userspace STs is that you don't need sudo to get them if the process is running under your user.

Windows

sigh...
...

Probably the easiest approach is through the magnificent ProcessExplorer, described here and in lots of other places across the internet.

download and install ProcessExplorer
launch ProcessExplorer
find the PID you want an ST of
right-click → properties
select the Threads tab
select the thread you'd like to inspect
click on the Stack button

The stack frames will look cryptic, but if that's good enough for you -- there you go! If you want a clearer ST, you'll have to download and install Debug Tools and load Debug Symbols into the ProcessExplorer as shown in tutorials:

Options → Configure Symbols
C:\Windows\system32\dbghlp.dll as the DbgHelp DLL path (or whatever it is in your system)
SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols as the Symbols Path (or whatever it is in your system)

JVM

Once again, in the case of JVM, there are at least 2 levels of call traces:

native
application

To get a native process ST please look above, as it's already been discussed.

To get an Application StackTrace, JVM provides an option to capture a ThreadDump (TD). There are a few ways to capture TDs of a running application:

jstack -l <pid> - requires a JDK utility jstack to be available on the same machine the JVM is running on. This is the preferred approach in almost all cases. It prints the TD to the jstack's output
jcmd <pid> Thread.print - just like the above: requires JDK utility to be present and prints TD to the jcmd's output. since it's more verbose and may be more cumbersome to remember, it's not a widely used technique
kill -3 <pid> or kill -s QUIT <pid> (both are identical) - does not require any additional software, as kill usually comes with the Linux OS. Sending a SIGQUIT signal to the JVM causes it to print a TD to the JVM's output. And that's the main inconvenience, compared to jstack and jcmd approaches -- you don't get the TD here and now - you have to find the JVM's output (could be either a log file or a console) and extract the TD from there. It can get very messy... But sometimes that's the only option you have.

Capturing a TD in a JVM stops the JVM for a split second and resumes it immediately after. In some cases, the JVM might appear stuck and jstack may recommend using the -F option. As always, handle the force with great care, as it might lock all your application threads for good (ask me how I know it).

Other runtimes

Some runtimes have their own alternatives to JVM's jstack, others - either don't or support plugins for GNU's gdb to acquire TDs through a debugger. Refer to your runtime's documentation to see if they have any means to capture a call trace. Asking Google for a second opinion doesn't hurt either.

Analysis

CallTrace analysis can provide us with a lot of information about what's going on with the application:

how many threads are doing X at the moment
how many threads are there
which actions are "hanging" and at which function calls
which threads are slacking
which threads are stuck in infinite loops and which functions have caused them
which threads are likely to be causing high CPU usage; what are they doing to cause that
which threads are making outgoing long-lasting IO calls; where do these calls originate from in code
which threads are blocking other threads (using synchronization primitives) and at which function calls (both blocked and blocking)
which/how many threads are running in X part of your code (e.g. PaymentService2.transfer()), should you care to know if a particular method/function is being called
which threads are likely to blow the Stack up (cause stack overflows); and through which function calls
how large are your thread pools and whether they are over/undersized
other useful info, depending on what's the problem

Individual STs

Analysing individual plain-text StackTraces might seem challenging at first, but a few stacks later the pattern becomes easier to understand. However, in cases of ThreadDumps, especially with lots of running threads (hundreds-thousands), the manual analysis still is a difficult task to take on.

Find the right threads

The approach of manual ST analysis depends on what issue are we trying to tackle. Usually, we're focusing on busy threads - the ones with long call traces. However, "long" is a relative term. In some applications "long" is 200 stack frames, while in others it could be as low as 30. Allow me to rephrase that:

Idle threads [short call traces] - threads have started up and initialized and now are waiting for tasks to work on; in standard Java web apps I like to take 30 frames as a point of reference and look for threads with call traces ~30 frames long; could be less, could be more. These threads are idle, regardless of their Thread state.
Busy threads [long call traces] - threads have received a task to work on and are currently processing it. While processing a task, their call traces will usually grow significantly larger than idle - could easily be hundreds of frames. These threads are busy, regardless of their Thread state.

Usually, we need to find those busy threads first.

There are no busy threads

It may happen that the application is running slow or "hanging" and there are no busy threads. The most likely reason is a bottleneck somewhere outside the application process. I mean, if the application is not doing anything (idling), then it's not slow -- no one is giving it any work to do. Was the request lost or discarded in transit, even before reaching your application process? Is it getting stuck in transit?

The second most likely reason is that the TD was captured at the wrong time: either before or after the slowness took place. This happens more often I'd like to admit. It will almost always happen if you are not the one capturing the trace and you working using someone else's hands ("Ops team, please capture the TD"; 5 minutes later: "here it is"; the issue went away 4 minutes ago), especially if that someone does not know how/when to capture it to obtain useful diagnostic information.

The third reason could be that the ST was captured from the wrong process. This is especially the case when requests are hitting a LoadBalancer of any kind (a DNS server is also a primitive LB) and there is no way to determine which server the request will land on. Often Ops take an ST/TD from any random server/instance in the pool that belongs to the application pool and assume that's what you've asked for.

And the fourth reason that I've come across is that you've got the idle/busy threshold wrong. In some applications, processing a task could add very few stack frames, as many as 1 to 5. Look at the traces you have, compare them, and see what's at the top of them. Perhaps you've miscalculated what's a "long" stack trace in your particular case.

There are some busy threads

Great, now that you've managed to find busy threads, it's time to see what's in there. Probably the quickest way to see what's happening in a thread is by reading it top-to-bottom - even better - just taking a look at the top 5 or so frames in the call trace. Start from the top. See what the most recent function call was - what's its name? This might hint to you what the thread was doing at the very moment this ST was captured. If the function call doesn't make much sense -- look at the frame below. If still nothing - look below. Keep going downwards until you reach a frame that you can make some sense of. If that frame is a good enough explanation for you of what the thread is doing -- hey presto! If not -- try going either back up trying to make sense of what that method you've recognised the frame of is doing inside, or down trying to find what other function has called it.

Once you know what that thread is doing - go to another. Don't worry, it will only take this long for the first few call traces. You'll get up to speed relatively quickly once you get the idea of what to look for and where.

See how many threads are busy doing more or less the same thing. If there is a handful of threads with very similar stack traces - you're probably on to your bottleneck. The higher in the stack you are - the more accurately you can define the nature of the bottleneck. It's best to be able to understand the top 5 frames in the stack.

Examples

"Camel (camel-8) thread #140 - aws-sqs://prod-q" #235 daemon prio=5 os_prio=0 tid=0x00007f5e40f26000 nid=0x113 runnable [0x00007f5c00dcb000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
        at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
        at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1346)
        at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
        at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:962)
...

java.net.SocketInputStream.socketRead0() -- means the application itself is waiting for a response over the network from a remote service it depends on. Probably the downstream service is lagging.

"threadPoolTaskExecutor-1" #114 prio=5 os_prio=0 cpu=4226.74ms elapsed=356.38s tid=0x00007f1b5df8cee0 nid=0xd3 waiting on condition  [0x00007f1b568a9000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
        - parking to wait for  <0x00001000dbd3afd0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.16/AbstractQueuedSynchronizer.java:2081)
        at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:393)
        at org.apache.http.pool.AbstractConnPool.access$300(AbstractConnPool.java:70)
        at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:253)
        - locked <0x00001000db478530> (a org.apache.http.pool.AbstractConnPool$2)
        at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:198)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:306)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:282)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)

...

org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(), that soon leads to a org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking() call and parking the thread -- means you have a HttpClient connection pool that's too small to handle the load effectively. You may need to increase that pool and retest.

"threadPoolTaskExecutor-5" #73 prio=5 os_prio=0 cpu=1190.99ms elapsed=5982.61s tid=0x00007f103db6e940 nid=0xb3 waiting on condition  [0x00007f103ac64000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
        - parking to wait for  <0x0000100073c178f0> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:194)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.16/CompletableFuture.java:1796)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.16/ForkJoinPool.java:3128)
        at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.16/CompletableFuture.java:1823)
        at java.util.concurrent.CompletableFuture.join(java.base@11.0.16/CompletableFuture.java:2043)
...

java.util.concurrent.locks.LockSupport.park(), when you see some CompletableFuture.get() calls lying around - probably you are using a common ForkJoin thread pool for your CompletableFutures, which by default is limited to a very small number of threads (count(CPU)-1). Either rewrite the code to use your custom Executor or increase the size of the Common pool.

Flame graphs

What they are

While manual analysis of plain-text STs is tedious, difficult and requires a lot of effort, the skill of reading an ST is very useful to gain in order to use more robust ST analysis techniques. My all-time favourite is FlameGraphs. At first look, they are funny-looking pictures, kind of diagrams, but very weird ones. Kind of skyscrapers with lots of floors. And the colours don't make it look any prettier.

I agree, flamegraphs aren't pretty pictures. Because that's not what they are. Instead, they are a very powerful, interactive, multi-dimensional graphical representation of related tree data, like CallTraces. You don't have to like them to understand them. I like to call them "multiple call traces laid over each other". As if they were transparent sheets of plastic with some similar curves and shapes on them, and you could pile them up and look through the pile and see how different or similar these curves and shapes are.

If all the call traces were identical across all the STs captured, in a flamegraph you'd see a single column with identical stack frames. Simple, because they are all the same, and if you lay them all on top of each other -- they will look the same. No differences.

If we have all traces starting with the same 5 calls and only subsequent calls divided into 2 different sets of function call sequences across all the STs -- in a FlameGraph we'll have a single column 5 frames high, on top of which are 2 narrower columns, representing 2 different sequences of function calls, as stated before.

Now, since STs in busy applications usually have a whole lot of different threads with different sack traces, their flamegraphs will look like a city with skyscrapers.

Dimensions

Each FlameGraph has at least 3 dimensions:

height - that's how deep in the stack the frame is. The higher the frame -- the deeper (the more recent) the call is. Just like in raw StackTraces.
width - how many such identical sequences of stack frames are, compared to other stack frames
- if the bar is as wide as the picture -- all these calls have been seen in all the threads
- if the bar is 50% of the picture width -- 50% of threads had this call in their STs at this same position (with the same sequence of preceding calls); the remaining 50% had different calls at that depth of the stack
colours - that's optional, though a handy dimension: it highlights calls made by functions written by us, and dims down function calls inside libraries. It's just easier to filter out the calls we're primarily interested in from the ones that libraries are doing internally.
text - describes what was the function call (basically an excerpt from the CallTrace)
other - optionally, there might be on-mouseover pop-ups providing even more info on each frame

Each frame in the flamegraph is clickable. By clicking on a frame you can expand/zoom in its tree of sub-calls (i.e. make the clicked frame 100% wide and widen all the upper frames respectively). By clicking on frames below the widened one you can zoom out.

I know, it's a lot to take in. Developers I tried to show flamegraphs to were confused and reluctant to learn using them, thinking it's an overly complex tool, requiring a lot of knowledge/skill/time to get any use out of. IT'S REALLY NOT!!! If you can read CallTraces - reading FlameGraphs will be just as easy, if you bother to play with them for 10 minutes for the first time. You shouldn't need more than 10min to understand the concept.

Why bother?

Aaahh, a fine question. And it IS worth the bother because it reduces the amount of data you have to analyze from 70MB of plain-text TDs down to a single image that all fits on your screen. Moreover, a FlameGraph has most of the analysis already done for you, and visualised it!

In general, when I'm analysing FlameGraphs, I'm looking for the largest clusters of identical CallTraces. They will be depicted as skyscrapers with the flattest rooftops. The higher the skyscraper and the flatter the rooftop -- the more likely that's a bottleneck. But how are buildings and CallTraces related..?

the higher the column - the deeper the thread is in its task execution. Remember, when analysing raw STs we are focusing on busy threads -- the ones with large call traces.
the more threads have that same call trace (up to some particular frame), the wider the tower/column will look

Example

Consider this Clojure example link:

Let me break it down for you. We're looking at separate "towers". Here we can see 4 significantly different "towers"

lots of sharp spikes on this one, meaning several threads are working on this kind of task (the base of the tower is somewhat wide), but while processing those tasks threads are doing all kinds of things (different spikes). There's no 1 thing multiple threads would be stuck on (the same frame across multiple threads would be represented as a wide cell: the more threads - the wider the cell). I don't see any problems there.
The roof of this tower is so wide that it's hard to even call it a tower. More like a plane (geological)... The stack depth is definitely more than 30 frames. Are these busy or idle threads? If idle - we don't care much for them, but if busy - ... That's worth looking into. It's hard to imagine an idle thread that would be waiting for new tasks while calling Object.hashCode(), so these must be busy threads. Why are soooo many threads (because the column is very wide) stuck at calling Object.hashCode()? This definitely looks like a severe performance problem, as normally hash calculation is too quick to even appear in StackTraces. And now we have not one, not two, but over 50% of all the threads stuck there! Does this hashcode function have some custom implementation? Need to look into it.
This tower is the narrowest of 4 and the rooftop has lots of antennae and no flat planes for a helicopter to land on. That looks like a healthy execution - don't see any problems there!
That's a relatively wide tower, with a wide base. But look at the top - it has a few spikes and a big flat surface at the very top! This flat rooftop suggests that several threads are stuck at an identical function call and cannot proceed further. What's written on that rooftop...? I can't see well. Zoom in on that tower by clicking on one of the widest frames on it. There, all better now. Looks like several threads are stuck calling Object.clone(). It could be normal in some cases because clone() might have to traverse a large graph of objects (if it's a deep clone), or it might be another performance issue. Might be worth looking into, right after the elephant named Object.hashCode() leaves the room.

Reverse-roots Flame graph

Once you get a good grip on FlameGraphs, I'd like to introduce Reverse-roots Flamegraph (RrFG) to you. Allow me to do that through an example.

Consider a FlameGraph below. The application is sluggish and takes its time to respond to some requests, while it deals with other requests very quickly. The common denominator between the slow requests is that the application is making downstream network calls in order to process those requests. So we suspect network-related issues. Now in this FlameGraph we see several high columns, some with flat rooftops, others - with spikey ones. How do we know which ones are making network calls? Well, we hover a mouse over each and every rooftop and see what the popup says - whether it's a socketRead0() or not. And this is a tedious and error-prone job to do. See the FlameGraph below -- the circled rooftops are socketRead0() calls. Can you tell how many threads are doing that call? I certainly can't, w/o a notepad, a pencil and a calculator and half an hour to spare. Wouldn't it be nice to have all the similar "roofs" aggregated in one place?

This is where the RrFGs come in. And it does exactly that. It reverses all the StackTraces and draws the FlameGraph as before. However, before reversing, most of the STs started (at the bottom) with Thread.run() and ended with Object.wait(). After reversing, these STs now start with Object.wait() and end with Thread.run(). Not only does this manoeuvre change the order of reading a stack, it also now sums up all the Object.wait() calls into a single cell at the very bottom. Similarly, it aggregates all the socketRead0() calls as well. Now we can see the total number of socketRead0() calls, how they compare to other calls across the threads, and we know where these calls originated from (by reading up through the column).

Series of STs. Profiling

Analysing a StackTrace gives you plenty of useful information to work with, but that is all assuming you have the right ST. In some cases it might be difficult to obtain an ST reflecting the problem in question, especially when it lasts for up to a second, more so - when it's not easy to reproduce it or its occurrences are unpredictable. So instead of fishing sprats with a harpoon, we can cast a net and hope we catch at least one of them. We can do this so by capturing StackTraces of the same process every 500ms or even more frequently and appending them to a file. This increases the chance that we'll catch the application at the time it's struggling and capture the informative ST we need for analysis.

Suppose you're a micromanager. You have a developer under you and you want to know what he's doing. The easiest approach is to come by his desk every 2 minutes and see for yourself.

09:00-09:28 he's absent
09:30-12:44 he's at his desk
12:46-13:18 he's absent again
13:20-14:52 he's at his desk
15:54-15:58 he's absent
16:00-18:00 he's at his desk

Seeing the results we can speculate that the developer was 28 minutes late in the morning. Then he was working uninterrupted until 12:46, when he probably took a lunch break. 13:20 he came back and worked continuously to the EOD with a single short break at 15:54 - probably a bathroom break. What happened between each 2minutes poll - we don't know, maybe he was working, maybe running around naked - there's no way to know. But from the sequence of data we have, it seems like he's a pretty good and hard-working employee.

The same thing we can do with a sequence of StackTraces. If we poll the application's stack often enough, we may get a good enough idea of what it's doing over time: where it is fast, where it is slow, and where it idles. The closer the polls are, the better granularity of the application workings we get, and the more accurate application profile can we create. Of course, polling for 8 hours every 500ms or less might be overkill, not to mention the amount of data created. It's a good idea to only profile an application during the period it's having issues. As for the amount of STs to analyze -- either squash all those STs into a single FlameGraph and see at which functions the application was taking most of the time, or create a FlameGraph for each capture of the STs and compare them with each other. Or, you know, group STs by seconds (or 5sec, or so) and generate FlameGraphs out of each of the groups and compare them visually to each other. Or do all of the above.

Pitfalls

While StackTraces are tremendously useful diagnostic information, we should be careful both capturing and analysing them.

Capturing

It's very important WHEN the StackTrace is captured. In order to be a useful piece of diagnostic data, it must be captured at the right time -- right when the application is experiencing issues. Even 1 second too late or too soon means a world and rules the ST unfit for analysis.

Analysing

Having a tool as powerful as FlameGraph you might be tempted to start using phrases like "this method took longer to execute than <...>". There's no "longer" in StackTraces, just as there's no "quicker" . There is no time duration in StackTraces, just like there's no timespan in a 20-year-old picture of you as a kid. Merely by looking at a picture I would not be able to tell how long were you standing in that pose. And this is especially the case when squashing a ThreadDump with lots of threads running into a FlameGraph. It didn't take Object.hashCode() 50% longer to execute than any other method in the application. The right way to say it would be: "50% of busy threads were seen calling Object.hashCode()". And knowing the swift nature of hachCode() we may speculate that it's not normal for this method to appear in STs unless it takes longer to execute than expected.

We may have a sense of time if we capture a series of STs and squash them into a FlameGraph, because we do not have the timespan - these STs have been captured over time. But be careful, as the dimension of time is mixed with the dimension of threads-per-TD. Each snapshot you take may have a number of threads that are executing some function calls. And as tasks are executed, they tend to be picked up by different threads to execute. And no one says that execution can't last for 500ms sharp -- just long enough for you not to notice any difference between ST polls. And if you don't notice either thread swaps or execution durations (which you probably won't, as you don't know what happens between ST polls), you'll be tempted to assume a continuous execution of some time duration.

Be careful. StackTraces have nothing to do with the duration of execution. It might suggest it, assuming you capture a sequence of STs with fine enough granularity, but there always is a chance you missed an interruption/change of execution between the polls.

References

Written with StackEdit.

DEV Community