本文的描述基于Linux操作系统。
一、jstack命令结果格式说明
为便于叙述,运行“示例代码1”,使用jstack命令打印对应进程的线程栈信息,得到“线程栈信息示例1”。
示例代码1:
public class Main {
public static void main(String[] args) throws InterruptedException {
Object objA = new Object();
Object objB = new Object();
Thread threadA = new Thread(new Runnable() {
public void run() {
synchronized (objA) {
try {
Thread.sleep(10000);
} catch (InterruptedException e) {
e.printStackTrace();
}
synchronized (objB) {
System.out.println("hello");
}
}
}
}, "threadA");
Thread.sleep(5000);
threadA.start();
Thread threadB = new Thread(new Runnable() {
public void run() {
synchronized (objB) {
synchronized (objA) {
System.out.println("world");
}
}
}
}, "threadB");
threadB.start();
}
}
线程栈信息示例1:
2021-01-13 21:02:17
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode):
"Attach Listener" #12 daemon prio=9 os_prio=0 tid=0x00007f0584001000 nid=0x5112 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"DestroyJavaVM" #11 prio=5 os_prio=0 tid=0x00007f05bc00a800 nid=0x50e1 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"threadB" #10 prio=5 os_prio=0 tid=0x00007f05bc0f0000 nid=0x50f1 waiting for monitor entry [0x00007f05a589d000]
java.lang.Thread.State: BLOCKED (on object monitor)
at Main$2.run(Main.java:32)
- waiting to lock <0x000000078665b048> (a java.lang.Object)
- locked <0x000000078665b058> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"threadA" #9 prio=5 os_prio=0 tid=0x00007f05bc0ee000 nid=0x50f0 waiting for monitor entry [0x00007f05a599e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f05bc0d2000 nid=0x50ed runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C1 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007f05bc0bd000 nid=0x50ec waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f05bc0bb000 nid=0x50eb waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f05bc0b8800 nid=0x50ea waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f05bc0b7000 nid=0x50e9 runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f05bc084000 nid=0x50e8 in Object.wait() [0x00007f05a6e83000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000786608ed0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
- locked <0x0000000786608ed0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)
"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f05bc081800 nid=0x50e7 in Object.wait() [0x00007f05a6f84000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000786606bf8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:502)
at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
- locked <0x0000000786606bf8> (a java.lang.ref.Reference$Lock)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
"VM Thread" os_prio=0 tid=0x00007f05bc078000 nid=0x50e6 runnable
"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f05bc01f800 nid=0x50e2 runnable
"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f05bc021000 nid=0x50e3 runnable
"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f05bc023000 nid=0x50e4 runnable
"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f05bc024800 nid=0x50e5 runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007f05bc0dd000 nid=0x50ee waiting on condition
JNI global references: 5
Found one Java-level deadlock:
=============================
"threadB":
waiting to lock monitor 0x00007f058c002418 (object 0x000000078665b048, a java.lang.Object),
which is held by "threadA"
"threadA":
waiting to lock monitor 0x00007f058c0064b8 (object 0x000000078665b058, a java.lang.Object),
which is held by "threadB"
Java stack information for the threads listed above:
===================================================
"threadB":
at Main$2.run(Main.java:32)
- waiting to lock <0x000000078665b048> (a java.lang.Object)
- locked <0x000000078665b058> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"threadA":
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
Found 1 deadlock.
1.1、线程栈信息主体
主体就是对一个个Java线程打印时状态的信息描述,在“线程栈信息示例1”中就是从"Attach Listener" #12 daemon prio=9 os_prio=0 tid=0x00007f0584001000 nid=0x5112 waiting on condition [0x0000000000000000]
到"VM Periodic Task Thread" os_prio=0 tid=0x00007f05bc0dd000 nid=0x50ee waiting on condition
的部分。
在一个Java进程中,线程分为“业务线程”和“VM线程”两种。
在“线程栈信息示例1”中,业务线程有:
- threadA
- threadB
在“线程栈信息示例1”中,VM线程有:
- Attach Listener
- DestroyJavaVM
- Service Thread
- C1 CompilerThread2
- C2 CompilerThread1
- C2 CompilerThread0
- Signal Dispatcher
- Finalizer
- Reference Handler
- VM Thread
- GC task thread#0 (ParallelGC)
- GC task thread#1 (ParallelGC)
- GC task thread#2 (ParallelGC)
- GC task thread#3 (ParallelGC)
- VM Periodic Task Thread
1.1.1、线程名
含义:线程名,就是java.lang.Thread类实例的name属性值。
重要性:重要。
示例:比如“threadB”,“threadA”,“Finalizer”。
1.1.2、Thread实例ID
含义:就是java.lang.Thread类实例的tid属性值。在创建java.lang.Thread实例时自动创建,在线程存活期内不可变,在线程死亡后其tid可被其他线程复用。
重要性:不重要。
示例:比如“#12”中的“12”。
1.1.3、是否为后台线程
含义:表征该线程是否为后台线程,对应java.lang.Thread类实例的daemon属性值。默认为非后台线程,使用“daemon”表征后台线程。
重要性:不重要。
示例:比如"Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f05bc0d2000 nid=0x50ed runnable [0x0000000000000000]
中的“daemon”。
1.1.4、prio
含义:Java线程的优先级,对应java.lang.Thread类实例的priority属性值。
重要性:不重要。
示例:比如“prio=9”。
1.1.5、os_prio
含义:Java线程所映射到Linux原生线程的优先级。
重要性:不重要。
示例:比如“os_prio=0”。
1.1.6、tid
含义:java.lang.Thread类实例背后的C++数据结构实例的内存地址。注意不要跟“1.1.2、Thread实例ID”小节中的tid混淆。
重要性:不重要。
示例:比如“tid=0x00007f05bc0d2000”。
1.1.7、nid
含义:所映射到Linux原生线程的ID。排查高CPU占用线程问题时经常用到,参见《Linux下如何定位Java进程内CPU消耗最多的Java线程》。
重要性:重要。
示例:比如“nid=0x50ed”。
1.1.8、基本描述
含义:线程打印时状态的基本描述。
重要性:不重要。需要注意的是,可能跟详细描述不一致。
示例:比如runnable
,waiting for monitor entry
,waiting on condition
和in Object.wait()
。
1.1.9、最后已知栈指针
含义:最后已知的栈帧指针,具体含义笔者也不是很清楚。
重要性:不重要。
示例:比如“[0x00007f05a599e000]”。
1.1.10、详细描述
含义:线程打印时状态的详细描述。第一行给出了线程的状态,有RUNNABLE
,BLOCKED
,WAITING
和TIMED_WAITING
4种可能(完整的线程状态列表可见《Java并发编程基础》,使用jstack命令打印线程栈信息时并不会包含NEW
和TERMINATED
状态的线程);接下来几行打印了执行路径栈。
重要性:重要。
示例:如下都是示例。
1、RUNNABLE
处于IO_WAIT
情形的线程:
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x000000078661b698> (a java.io.BufferedInputStream)
at Main.main(Main.java:10)
处于RUNNING
或者READY
情形的线程:
java.lang.Thread.State: RUNNABLE
at Main.main(Main.java:9)
2、BLOCKED
已申请获得一个synchronized锁,再尝试申请另外一个synchronized锁:
java.lang.Thread.State: BLOCKED (on object monitor)
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
以上示例中,“<0x000000078665b058>”和“<0x000000078665b048>”是synchronized锁对应的Java对象在某个维度的内存地址,可能是Java对象维度的内存地址,也可能是Java对象背后C++数据结构实例的内存地址。对于其明确含义“不求甚解”,不影响问题排查。接下来简称该内存地址为synchronized锁锁标记
。
3、WAITING
调用Object类的wait()
方法的线程:
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000078665adc8> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at Main.main(Main.java:11)
- locked <0x000000078665adc8> (a java.lang.Object)
调用LockSupport类的park()
方法的线程:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at Main.main(Main.java:7)
调用LockSupport类的park(Object blocker)
方法的线程:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000078665ad88> (a java.lang.Object)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at Main.main(Main.java:7)
备注:
- 在以上最后一个例子中,“<0x000000078665ad88>”是传入blocker这个Java对象在某个维度的内存地址,可能是Java对象维度的内存地址,也可能是Java对象背后C++数据结构实例的内存地址。对于其明确含义“不求甚解”,不影响问题排查。接下来简称该内存地址为
Park锁锁标记
- 在《Java并发编程基础》中我们知道,LockSupport类的方法
park()
和相应的重载方法park(Object blocker)
核心语义一致,他们的差别也仅在于:在jstack命令打印结果中,包含后者的执行路径栈与包含前者的执行路径栈相比,额外多了一句形如- parking to wait for <0x000000078665ad88> (a java.lang.Object)
的描述 - 需要对调用LockSupport类
park(Object blocker)
方法时的形如- parking to wait for <0x000000078665ad88> (a java.lang.Object)
描述与申请synchronized锁时的形如- waiting to lock <0x00000007ab3f31b8> (a java.lang.Object)
描述作区分:后者中的0x00000007ab3f31b8
指代一个synchronized锁对象,线程因申请该synchronized锁对象不得而挂起,因此描述具有具体明确的语法语义;前者中的0x000000078665ad88
指代传入的blocker对象,该blocker对象与线程挂起不一定有关系,因此描述不具有具体明确的语法语义,只是起到打印上下文信息的日志作用
4、TIMED_WAITING
调用Thread类sleep(long millis)
/sleep(long millis, int nanos)
方法的线程:
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at Main.main(Main.java:8)
调用Object类wait(long timeout)
/wait(long timeout, int nanos)
方法的线程:
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000078665ad70> (a java.lang.Object)
at Main.main(Main.java:6)
- locked <0x000000078665ad70> (a java.lang.Object)
调用LockSupport类parkNanos(long nanos)
/parkUntil(long deadline)
方法的线程:
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
at Main.main(Main.java:5)
调用LockSupport类parkNanos(Object blocker, long nanos)
/parkUntil(Object blocker, long deadline)
方法的线程:
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000078665ad28> (a java.lang.Object)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at Main.main(Main.java:6)
备注:
- 对以上最后一个例子中形如
- parking to wait for <0x000000078665ad28> (a java.lang.Object)
描述的说明参见上一小节的备注
1.2、线程栈信息其他
1.2.1、头部
比如“线程栈信息示例1”中的:
2021-01-12 21:50:42
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.211-b12 mixed mode):
1.2.2、尾部
比如“线程栈信息示例1”中的:
JNI global references: 5
Found one Java-level deadlock:
=============================
"threadB":
waiting to lock monitor 0x00007f058c002418 (object 0x000000078665b048, a java.lang.Object),
which is held by "threadA"
"threadA":
waiting to lock monitor 0x00007f058c0064b8 (object 0x000000078665b058, a java.lang.Object),
which is held by "threadB"
Java stack information for the threads listed above:
===================================================
"threadB":
at Main$2.run(Main.java:32)
- waiting to lock <0x000000078665b048> (a java.lang.Object)
- locked <0x000000078665b058> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
"threadA":
at Main$1.run(Main.java:18)
- waiting to lock <0x000000078665b058> (a java.lang.Object)
- locked <0x000000078665b048> (a java.lang.Object)
at java.lang.Thread.run(Thread.java:748)
Found 1 deadlock.
需要注意的是,上述描述“发现死锁”的部分内容(从Found one Java-level deadlock:
开始到Found 1 deadlock.
为止)只在探测到存在该种简单死锁情形时才会出现。
二、排查问题
排查Java线程相关问题围绕“jstack命令打印的线程栈信息文件”进行,其是一切的源头,相较最为权威。分析上述线程栈信息文件,可使用辅助工具(比如下面会谈及的“XSheepdog”,“TDA开源工具”等)但也须对“辅助工具本身可能存在各种问题,甚至造成困扰,不可过度依赖”保持清醒。
2.1、排查Java线程运行卡滞问题
Java线程运行卡滞的含义是:大部分时间处于WAITING
、TIMED_WAITING
、BLOCKED
状态,或者是RUNNABLE
状态的IO_WAIT
情形。
卡滞问题其实并不好排查,最好排查的卡滞情形是:一个Java线程T0申请获取到synchronized锁S0,其他大量Java线程Tx尝试申请S0不得而进入BLOCKED状态,T0长时间非正常持有,导致Tx出现卡滞现象。
排查上述卡滞情形问题的路径是:聚合分析jstack命令的打印结果,定位被最多线程等待获取的synchronized锁,并找出持有该synchronized锁的线程。尝试了几种方案,对方案的评价如表1。
表1
方案 | 评价 |
---|---|
线上方案,比如“XSheepdog”[3] | 存在数据安全风险 |
TDA开源工具[4] | 能够快速方便定位被最多线程等待的synchronized锁(在“Monitors”页面中会标红被怀疑的synchronized锁),紧接着也能快速方便找到持有该synchronized锁的线程。 不过对于该工具,笔者感觉有两个小瑕疵:1)“State”列采用了“基本描述”字段值,而没有采用来自“详细描述”字段中第一行的精确的线程状态值;2)将形如 - parking to wait for <0x000000078665ad88> (a java.lang.Object) 描述中的Park锁锁标记 与synchronized锁锁标记 混淆 |
个人脚本 | 1、对jstack命令打印结果文件“jstack.out”执行grep -E "waiting to lock" jstack.out | awk -F"<" '{print $(2)}' | awk -F">" '{print $(1)}' | sort | uniq -c 命令找到疑似synchronized锁锁标记2、然后再在“jstack.out”文件中搜索 locked <步骤1中获取到的疑似synchronized锁锁标记> 找到持有相应synchronized锁的线程。特别需要注意的是:以locked <步骤1中获取到的疑似synchronized锁锁标记> 方式去搜索持有疑似synchronized锁的线程需要排除调用Object.wait或者Thread.join(本质通过调用Object.wait实现)方法的情形[6] |
参考文献
[1]https://dzone.com/articles/how-to-read-a-thread-dump
[2]https://gist.github.com/rednaxelafx/843622
[3]https://www.perfma.com/docs/thread/thread-guide
[4]https://github.com/irockel/tda
[5]https://dzone.com/articles/how-analyze-java-thread-dumps
[6]《Java并发编程基础》中“6.6、调用Object.wait或者Thread.join方法挂起查看所打印jstack命令结果的一个注意点”小节