Finding infinite loops in Java programs that only occur on production systems with StackTraceSealer

Motivation

Assume you have a nasty bug somewhere in your code which results in thread spinning and eating 100% of your precious CPU every now and then. Of course, the behaviour can never be reproduced in a test environment, only on production. Luckily, there is JMX and JConsole, so you can obtain a thread dump from production to see where the problematic thread is spending its time.

Unfortunately, when looking at the thread dump, it may look like this (method names anonymized):

	 at java.lang.Thread.yield(Native Method)
	 at net.sf.stsealer.examples.LongExample.call50(LongExample.java:65)
	 at net.sf.stsealer.examples.LongExample.call49(LongExample.java:62)
	 at net.sf.stsealer.examples.LongExample.call48(LongExample.java:61)
	 at net.sf.stsealer.examples.LongExample.call47(LongExample.java:60)
	 at net.sf.stsealer.examples.LongExample.call46(LongExample.java:59)
	 at net.sf.stsealer.examples.LongExample.call45(LongExample.java:58)
	 at net.sf.stsealer.examples.LongExample.call44(LongExample.java:57)
	 at net.sf.stsealer.examples.LongExample.call43(LongExample.java:56)
	 at net.sf.stsealer.examples.LongExample.call42(LongExample.java:55)
	 at net.sf.stsealer.examples.LongExample.call41(LongExample.java:54)
	 at net.sf.stsealer.examples.LongExample.call40(LongExample.java:53)
	 at net.sf.stsealer.examples.LongExample.call39(LongExample.java:52)
	 at net.sf.stsealer.examples.LongExample.call38(LongExample.java:51)
	 at net.sf.stsealer.examples.LongExample.call37(LongExample.java:50)
	 at net.sf.stsealer.examples.LongExample.call36(LongExample.java:49)
	 at net.sf.stsealer.examples.LongExample.call35(LongExample.java:48)
	 at net.sf.stsealer.examples.LongExample.call34(LongExample.java:47)
	 at net.sf.stsealer.examples.LongExample.call33(LongExample.java:46)
	 at net.sf.stsealer.examples.LongExample.call32(LongExample.java:45)
	 at net.sf.stsealer.examples.LongExample.call31(LongExample.java:44)
	 at net.sf.stsealer.examples.LongExample.call30(LongExample.java:43)
	 at net.sf.stsealer.examples.LongExample.call29(LongExample.java:42)
	 at net.sf.stsealer.examples.LongExample.call28(LongExample.java:41)
	 at net.sf.stsealer.examples.LongExample.call27(LongExample.java:40)
	 at net.sf.stsealer.examples.LongExample.call26(LongExample.java:39)
	 at net.sf.stsealer.examples.LongExample.call25(LongExample.java:38)
	 at net.sf.stsealer.examples.LongExample.call24(LongExample.java:37)
	 at net.sf.stsealer.examples.LongExample.call23(LongExample.java:36)
	 at net.sf.stsealer.examples.LongExample.call22(LongExample.java:35)
	 at net.sf.stsealer.examples.LongExample.call21(LongExample.java:34)
	 at net.sf.stsealer.examples.LongExample.call20(LongExample.java:33)
	 at net.sf.stsealer.examples.LongExample.call19(LongExample.java:32)
	 at net.sf.stsealer.examples.LongExample.call18(LongExample.java:31)
	 at net.sf.stsealer.examples.LongExample.call17(LongExample.java:30)
	 at net.sf.stsealer.examples.LongExample.call16(LongExample.java:29)
	 at net.sf.stsealer.examples.LongExample.call15(LongExample.java:28)
	 at net.sf.stsealer.examples.LongExample.call14(LongExample.java:27)
	 at net.sf.stsealer.examples.LongExample.call13(LongExample.java:26)
	 at net.sf.stsealer.examples.LongExample.call12(LongExample.java:25)
	 at net.sf.stsealer.examples.LongExample.call11(LongExample.java:24)
	 at net.sf.stsealer.examples.LongExample.call10(LongExample.java:23)
	 at net.sf.stsealer.examples.LongExample.call09(LongExample.java:22)
	 at net.sf.stsealer.examples.LongExample.call08(LongExample.java:21)
	 at net.sf.stsealer.examples.LongExample.call07(LongExample.java:20)
	 at net.sf.stsealer.examples.LongExample.call06(LongExample.java:19)
	 at net.sf.stsealer.examples.LongExample.call05(LongExample.java:18)
	 at net.sf.stsealer.examples.LongExample.call04(LongExample.java:17)
	 at net.sf.stsealer.examples.LongExample.call03(LongExample.java:16)
	 at net.sf.stsealer.examples.LongExample.call02(LongExample.java:15)
	 at net.sf.stsealer.examples.LongExample.call01(LongExample.java:14)
	 at net.sf.stsealer.examples.LongExample.main(LongExample.java:10)
	 at net.sf.stsealer.examples.Main.main(Main.java:10)

And if you keep creating new thread dumps, the stack trace always looks the same; so the hunt begins. While it is pretty obvious there is no infinite loop in Thread.yield(), it is more or less unclear if the culprit is in call50 method, or call49 method, or in any of the other ones down the stack. So you have to analyze all methods in turn, until you find the one that contains the infinite loop. Meanwhile, your faulty program has burnt lots of CPU cycles, but did not help you at all tracking down the problem. While in the background, lots of stack frames have been popped and pushed again, due to the fact that there is one spot where most time is spent (in this example it is Thread.yield(), but it can also be some I/O operation), thread dumps will always look the same.

But there is a solution.

StackTraceSealer

The basic idea behind StackTraceSealer: Provide a way to "seal" stack frames on a Thread's stack trace in production, via JMX. Whenever a frame is popped from the stack, it becomes unsealed, so you can easily see where the border between the still sealed stack frames (that have never been exited) and the unsealed stack frames (that have been exited and re-entered) is. That is the point where you have to look at.

But the world is not ideal, and Java does not provide such a feature out of the box. What it does provide, though, is a way to build a rough approximation to this ideal: When "sealing" a stack frame, you can instrument the method it is currently executing. The instrumented method will not affect currently running stack frames, but if the frame is exited and re-entered, the instrumented method is called, which will inform our JMX agent to consider the current frame unsealed. If the frame is exited and a different frame is entered, it will not be caught by this method, but in that case it can be caught when creating the next thread dump, as the stack frame will no longer point at the same location.

Java's instrumentation API is generally well designed, however there is another little flaw: When re-instrumenting already loaded classes, native methods cannot be instrumented. Therefore, if the border between the sealed and the unsealed stack frames is at a native method, it may be off by one (or more in case of more native methods).

Another caveat: The infinite loop will usually be in the first sealed frame, as it has never been exited. If there are line numbers in your stack traces, and there are two possible paths up to the last unsealed frame (as that method is called from two places) it can happen that your second thread dump accidentally includes the second call, the line numbers differ, and the frame becomes unsealed - therefore the infinite loop may also be in the last unsealed frame instead.

Hands-on tutorial – let's see StackTraceSealer in action

When you download StackTraceSealer and extract it, you will find 4 Jar files inside it:

examples: Small example programs to test StackTraceSealer on.
runtime: The JMX MBean and other runtime classes to be injected into the target process
agent: The agent library that does the bytecode transformations
plugin: The JConsole plugin

Except for the examples jar, which is not needed for operation, the other three Jar files have to be in the same directory so that the plugin can find them.

Now, let's start some example:
java -jar StackTraceSealer-examples.jar long

The example will run and not stop, because it contains an infinite loop. So, also start JConsole, but with our plugin:
jconsole -pluginpath StackTraceSealer-plugin.jar

After having attached to our example process by PID, there will be a new StackTraceSealer tab in JConsole. However, when switching to it, it is still not possible to seal our stack traces. The reason is simple: We need to load our agent library first. As loading agent libraries is done via PID and cannot be done via a JMX connection, but the only information JConsole passes to the plugins is the JMX connection, we have to select the PID of our target process again on the StackTraceSealer tab and click "Load" to load our agent (see below how to preload the agent in scenarios you cannot load it via PID). Afterwards you will see a list of threads and the stack trace of the selected thread. If you select the main thread, you should see the long stacktrace already mentioned in the introduction.

Now click the "Seal" button. When the stack trace is updated next, you will see [S] markers next to some of the stack frames (the sealed ones) like this:

	 at java.lang.Thread.yield(Native Method)
	 at net.sf.stsealer.examples.LongExample.call50(LongExample.java:65)
	 at net.sf.stsealer.examples.LongExample.call49(LongExample.java:62)
	 at net.sf.stsealer.examples.LongExample.call48(LongExample.java:61)
	 at net.sf.stsealer.examples.LongExample.call47(LongExample.java:60)
	 at net.sf.stsealer.examples.LongExample.call46(LongExample.java:59)
	 at net.sf.stsealer.examples.LongExample.call45(LongExample.java:58)
	 at net.sf.stsealer.examples.LongExample.call44(LongExample.java:57)
	 at net.sf.stsealer.examples.LongExample.call43(LongExample.java:56)
[S]	 at net.sf.stsealer.examples.LongExample.call42(LongExample.java:55)
[S]	 at net.sf.stsealer.examples.LongExample.call41(LongExample.java:54)
[S]	 at net.sf.stsealer.examples.LongExample.call40(LongExample.java:53)
[S]	 at net.sf.stsealer.examples.LongExample.call39(LongExample.java:52)
[S]	 at net.sf.stsealer.examples.LongExample.call38(LongExample.java:51)
[S]	 at net.sf.stsealer.examples.LongExample.call37(LongExample.java:50)
[S]	 at net.sf.stsealer.examples.LongExample.call36(LongExample.java:49)
[S]	 at net.sf.stsealer.examples.LongExample.call35(LongExample.java:48)
[S]	 at net.sf.stsealer.examples.LongExample.call34(LongExample.java:47)
[S]	 at net.sf.stsealer.examples.LongExample.call33(LongExample.java:46)
...

The rest of the frames are also shown, but omitted for clarity here. The first sealed frame, in this case call42 (no, the number is not a coincidence), contains the infinite loop.

If you want to, you can now "unseal" the stack trace and let the program continue. In practice, you will most likely want to fix the bug now and restart the fixed version afterwards. (In theory, after unsealing all threads, the program should continue running as it did before, without any performance penalties. In practice, better be safe and restart it as soon as possible).

The other available examples are "native", "oneoff", and "nativeoneoff". Have a look at them and try to find the infinite loop (Spoiler: The infinite loop in all these examples is in call07).

Preloading StackTraceSealer agent library

Depending on your usual monitoring, you might be in a situation where you cannot load either the JMX agent or the StackTraceSealer agent at runtime. Probably your monitoring equipment is running on a different machine or with different credentials. In this case, you most likely have configured the JMX agent statically, therefore you should also configure the StackTraceSealer agent statically when you need it. To do so, first copy the runtime and the agent JAR to the target system. Then add a -javaagent option to your java call, for example:
java -Dcom.sun.management.jmxremote.port=1221 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent:StackTraceSealer-agent.jar=StackTraceSealer-runtime.jar -jar StackTraceSealer-examples.jar oneoff

Note that the javaagent parameter is the (absolute or relative) path to the runtime jar. Then you can connect JConsole to [hostname]:1221 as usual and StackTraceSealer will work without having to load anything first.