0%

使用jstack命令排查问题

本文的描述基于Linux操作系统。

一、jstack命令结果格式说明

为便于叙述,运行“示例代码1”,使用jstack命令打印对应进程的线程栈信息,得到“线程栈信息示例1”。

示例代码1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
public class Main {
public static void main(String[] args) throws InterruptedException {

Object objA = new Object();
Object objB = new Object();

Thread threadA = new Thread(new Runnable() {

public void run() {
synchronized (objA) {
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}

synchronized (objB) {
System.out.println("hello");
}
}
}
}, "threadA");

Thread.sleep(5000);
threadA.start();

Thread threadB = new Thread(new Runnable() {

public void run() {
synchronized (objB) {
synchronized (objA) {
System.out.println("world");
}
}
}
}, "threadB");
threadB.start();
}
}

线程栈信息示例1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
2021-01-13 21:02:17
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode):

"Attach Listener" #12 daemon prio=9 os_prio=0 tid=0x00007f0584001000 nid=0x5112 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" #11 prio=5 os_prio=0 tid=0x00007f05bc00a800 nid=0x50e1 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"threadB" #10 prio=5 os_prio=0 tid=0x00007f05bc0f0000 nid=0x50f1 waiting for monitor entry [0x00007f05a589d000]
java.lang.Thread.State: BLOCKED (on object monitor)
at Main$2.run(Main.java:32)
- waiting to lock <0x000000078665b048> (a java.lang.Object)
- locked <0x000000078665b058> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)

"threadA" #9 prio=5 os_prio=0 tid=0x00007f05bc0ee000 nid=0x50f0 waiting for monitor entry [0x00007f05a599e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)

"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f05bc0d2000 nid=0x50ed runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"C1 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007f05bc0bd000 nid=0x50ec waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f05bc0bb000 nid=0x50eb waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f05bc0b8800 nid=0x50ea waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f05bc0b7000 nid=0x50e9 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f05bc084000 nid=0x50e8 in Object.wait() [0x00007f05a6e83000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000786608ed0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
- locked <0x0000000786608ed0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f05bc081800 nid=0x50e7 in Object.wait() [0x00007f05a6f84000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000786606bf8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
- locked <0x0000000786606bf8> (a java.lang.ref.Reference$Lock)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"VM Thread" os_prio=0 tid=0x00007f05bc078000 nid=0x50e6 runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f05bc01f800 nid=0x50e2 runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f05bc021000 nid=0x50e3 runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f05bc023000 nid=0x50e4 runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f05bc024800 nid=0x50e5 runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007f05bc0dd000 nid=0x50ee waiting on condition

JNI global references: 5


Found one Java-level deadlock:
=============================
"threadB":
waiting to lock monitor 0x00007f058c002418 (object 0x000000078665b048, a java.lang.Object),
which is held by "threadA"
"threadA":
waiting to lock monitor 0x00007f058c0064b8 (object 0x000000078665b058, a java.lang.Object),
which is held by "threadB"

Java stack information for the threads listed above:
===================================================
"threadB":
at Main$2.run(Main.java:32)
- waiting to lock <0x000000078665b048> (a java.lang.Object)
- locked <0x000000078665b058> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"threadA":
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.

1.1、线程栈信息主体

主体就是对一个个Java线程打印时状态的信息描述,在“线程栈信息示例1”中就是从"Attach Listener" #12 daemon prio=9 os_prio=0 tid=0x00007f0584001000 nid=0x5112 waiting on condition [0x0000000000000000]"VM Periodic Task Thread" os_prio=0 tid=0x00007f05bc0dd000 nid=0x50ee waiting on condition的部分。

在一个Java进程中,线程分为“业务线程”和“VM线程”两种。
在“线程栈信息示例1”中,业务线程有:

  • threadA
  • threadB

在“线程栈信息示例1”中,VM线程有:

  • Attach Listener
  • DestroyJavaVM
  • Service Thread
  • C1 CompilerThread2
  • C2 CompilerThread1
  • C2 CompilerThread0
  • Signal Dispatcher
  • Finalizer
  • Reference Handler
  • VM Thread
  • GC task thread#0 (ParallelGC)
  • GC task thread#1 (ParallelGC)
  • GC task thread#2 (ParallelGC)
  • GC task thread#3 (ParallelGC)
  • VM Periodic Task Thread

1.1.1、线程名

含义:线程名,就是java.lang.Thread类实例的name属性值。
重要性:重要
示例:比如“threadB”,“threadA”,“Finalizer”。

1.1.2、Thread实例ID

含义:就是java.lang.Thread类实例的tid属性值。在创建java.lang.Thread实例时自动创建,在线程存活期内不可变,在线程死亡后其tid可被其他线程复用。
重要性:不重要。
示例:比如“#12”中的“12”。

1.1.3、是否为后台线程

含义:表征该线程是否为后台线程,对应java.lang.Thread类实例的daemon属性值。默认为非后台线程,使用“daemon”表征后台线程。
重要性:不重要。
示例:比如"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f05bc0d2000 nid=0x50ed runnable [0x0000000000000000]中的“daemon”。

1.1.4、prio

含义:Java线程的优先级,对应java.lang.Thread类实例的priority属性值。
重要性:不重要。
示例:比如“prio=9”。

1.1.5、os_prio

含义:Java线程所映射到Linux原生线程的优先级。
重要性:不重要。
示例:比如“os_prio=0”。

1.1.6、tid

含义:java.lang.Thread类实例背后的C++数据结构实例的内存地址。注意不要跟“1.1.2、Thread实例ID”小节中的tid混淆。
重要性:不重要。
示例:比如“tid=0x00007f05bc0d2000”。

1.1.7、nid

含义:所映射到Linux原生线程的ID。排查高CPU占用线程问题时经常用到,参见《Linux下如何定位Java进程内CPU消耗最多的Java线程》
重要性:重要
示例:比如“nid=0x50ed”。

1.1.8、基本描述

含义:线程打印时状态的基本描述。
重要性:不重要。需要注意的是,可能跟详细描述不一致
示例:比如runnablewaiting for monitor entrywaiting on conditionin Object.wait()

1.1.9、最后已知栈指针

含义:最后已知的栈帧指针,具体含义笔者也不是很清楚。
重要性:不重要。
示例:比如“[0x00007f05a599e000]”。

1.1.10、详细描述

含义:线程打印时状态的详细描述。第一行给出了线程的状态,有RUNNABLEBLOCKEDWAITINGTIMED_WAITING4种可能(完整的线程状态列表可见《Java并发编程基础》,使用jstack命令打印线程栈信息时并不会包含NEWTERMINATED状态的线程);接下来几行打印了执行路径栈。
重要性:重要
示例:如下都是示例。

1、RUNNABLE
处于IO_WAIT情形的线程:

1
2
3
4
5
6
7
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x000000078661b698> (a java.io.BufferedInputStream)
at Main.main(Main.java:10)

处于RUNNING或者READY情形的线程:

1
2
java.lang.Thread.State: RUNNABLE
at Main.main(Main.java:9)

2、BLOCKED
已申请获得一个synchronized锁,再尝试申请另外一个synchronized锁:

1
2
3
4
5
java.lang.Thread.State: BLOCKED (on object monitor)
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)

以上示例中,“<0x000000078665b058>”和“<0x000000078665b048>”是synchronized锁对应的Java对象在某个维度的内存地址,可能是Java对象维度的内存地址,也可能是Java对象背后C++数据结构实例的内存地址。对于其明确含义“不求甚解”,不影响问题排查。接下来简称该内存地址为synchronized锁锁标记

3、WAITING
调用Object类的wait()方法的线程:

1
2
3
4
5
6
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000078665adc8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at Main.main(Main.java:11)
- locked <0x000000078665adc8> (a java.lang.Object)

调用LockSupport类的park()方法的线程:

1
2
3
4
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at Main.main(Main.java:7)

调用LockSupport类的park(Object blocker)方法的线程:

1
2
3
4
5
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000078665ad88> (a java.lang.Object)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at Main.main(Main.java:7)

备注

  • 在以上最后一个例子中,“<0x000000078665ad88>”是传入blocker这个Java对象在某个维度的内存地址,可能是Java对象维度的内存地址,也可能是Java对象背后C++数据结构实例的内存地址。对于其明确含义“不求甚解”,不影响问题排查。接下来简称该内存地址为Park锁锁标记
  • 《Java并发编程基础》中我们知道,LockSupport类的方法park()和相应的重载方法park(Object blocker)核心语义一致,他们的差别也仅在于:在jstack命令打印结果中,包含后者的执行路径栈与包含前者的执行路径栈相比,额外多了一句形如- parking to wait for <0x000000078665ad88> (a java.lang.Object)的描述
  • 需要对调用LockSupport类park(Object blocker)方法时的形如- parking to wait for <0x000000078665ad88> (a java.lang.Object)描述与申请synchronized锁时的形如- waiting to lock <0x00000007ab3f31b8> (a java.lang.Object)描述作区分:后者中的0x00000007ab3f31b8指代一个synchronized锁对象,线程因申请该synchronized锁对象不得而挂起,因此描述具有具体明确的语法语义;前者中的0x000000078665ad88指代传入的blocker对象,该blocker对象与线程挂起不一定有关系,因此描述不具有具体明确的语法语义,只是起到打印上下文信息的日志作用

4、TIMED_WAITING
调用Thread类sleep(long millis)/sleep(long millis, int nanos)方法的线程:

1
2
3
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at Main.main(Main.java:8)

调用Object类wait(long timeout)/wait(long timeout, int nanos)方法的线程:

1
2
3
4
5
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000078665ad70> (a java.lang.Object)
at Main.main(Main.java:6)
- locked <0x000000078665ad70> (a java.lang.Object)

调用LockSupport类parkNanos(long nanos)/parkUntil(long deadline)方法的线程:

1
2
3
4
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
at Main.main(Main.java:5)

调用LockSupport类parkNanos(Object blocker, long nanos)/parkUntil(Object blocker, long deadline)方法的线程:

1
2
3
4
5
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000078665ad28> (a java.lang.Object)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at Main.main(Main.java:6)

备注

  • 对以上最后一个例子中形如- parking to wait for <0x000000078665ad28> (a java.lang.Object)描述的说明参见上一小节的备注

1.2、线程栈信息其他

1.2.1、头部

比如“线程栈信息示例1”中的:

1
2
2021-01-12 21:50:42
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode):

1.2.2、尾部

比如“线程栈信息示例1”中的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
JNI global references: 5


Found one Java-level deadlock:
=============================
"threadB":
waiting to lock monitor 0x00007f058c002418 (object 0x000000078665b048, a java.lang.Object),
which is held by "threadA"
"threadA":
waiting to lock monitor 0x00007f058c0064b8 (object 0x000000078665b058, a java.lang.Object),
which is held by "threadB"

Java stack information for the threads listed above:
===================================================
"threadB":
at Main$2.run(Main.java:32)
- waiting to lock <0x000000078665b048> (a java.lang.Object)
- locked <0x000000078665b058> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"threadA":
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)

Found 1 deadlock.

需要注意的是,上述描述“发现死锁”的部分内容(从Found one Java-level deadlock:开始到Found 1 deadlock.为止)只在探测到存在该种简单死锁情形时才会出现。

二、排查问题

排查Java线程相关问题围绕“jstack命令打印的线程栈信息文件”进行,其是一切的源头,相较最为权威。分析上述线程栈信息文件,可使用辅助工具(比如下面会谈及的“XSheepdog”,“TDA开源工具”等)但也须对“辅助工具本身可能存在各种问题,甚至造成困扰,不可过度依赖”保持清醒。

2.1、排查Java线程运行卡滞问题

Java线程运行卡滞的含义是:大部分时间处于WAITINGTIMED_WAITINGBLOCKED状态,或者是RUNNABLE状态的IO_WAIT情形。

卡滞问题其实并不好排查,最好排查的卡滞情形是:一个Java线程T0申请获取到synchronized锁S0,其他大量Java线程Tx尝试申请S0不得而进入BLOCKED状态,T0长时间非正常持有,导致Tx出现卡滞现象。
排查上述卡滞情形问题的路径是:聚合分析jstack命令的打印结果,定位被最多线程等待获取的synchronized锁,并找出持有该synchronized锁的线程。尝试了几种方案,对方案的评价如表1。

表1

方案 评价
线上方案,比如“XSheepdog”[3] 存在数据安全风险
TDA开源工具[4] 能够快速方便定位被最多线程等待的synchronized锁(在“Monitors”页面中会标红被怀疑的synchronized锁),紧接着也能快速方便找到持有该synchronized锁的线程。
不过对于该工具,笔者感觉有两个小瑕疵:1)“State”列采用了“基本描述”字段值,而没有采用来自“详细描述”字段中第一行的精确的线程状态值;2)将形如- parking to wait for <0x000000078665ad88> (a java.lang.Object)描述中的Park锁锁标记synchronized锁锁标记混淆
个人脚本 1、对jstack命令打印结果文件“jstack.out”执行grep -E "waiting to lock" jstack.out | awk -F"<" '{print $(2)}' | awk -F">" '{print $(1)}' | sort | uniq -c命令找到疑似synchronized锁锁标记
2、然后再在“jstack.out”文件中搜索locked <步骤1中获取到的疑似synchronized锁锁标记>找到持有相应synchronized锁的线程。特别需要注意的是:以locked <步骤1中获取到的疑似synchronized锁锁标记>方式去搜索持有疑似synchronized锁的线程需要排除调用Object.wait或者Thread.join(本质通过调用Object.wait实现)方法的情形[6]

参考文献

[1]https://dzone.com/articles/how-to-read-a-thread-dump
[2]https://gist.github.com/rednaxelafx/843622
[3]https://www.perfma.com/docs/thread/thread-guide
[4]https://github.com/irockel/tda
[5]https://dzone.com/articles/how-analyze-java-thread-dumps
[6]《Java并发编程基础》中“6.6、调用Object.wait或者Thread.join方法挂起查看所打印jstack命令结果的一个注意点”小节

您的支持将鼓励我继续分享!