0.系統信息:問題發生的系統是Android P,跑壓測復現到,復現過兩次,目前看起來是google原生的bug。
1.異常報如下trace,也就是在等binderThread創建超時
Watchdog: Reporting stuck state to activity controller
MonitorActivityController: ** ERROR: PROCESS NOT RESPONDING
MonitorActivityController: message: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
MonitorActivityController: #
MonitorActivityController: Allowing system to die.
Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
Watchdog: android.fg annotated stack trace:
Watchdog: at android.os.Binder.blockUntilThreadAvailable(Native Method)
Watchdog: at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:234)
Watchdog: at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:206)
Watchdog: at android.os.Handler.handleCallback(Handler.java:873)
Watchdog: at android.os.Handler.dispatchMessage(Handler.java:99)
Watchdog: at android.os.Looper.loop(Looper.java:193)
Watchdog: at android.os.HandlerThread.run(HandlerThread.java:65)
Watchdog: at com.android.server.ServiceThread.run(ServiceThread.java:44)
Watchdog: *** GOODBYE!
2.往上回溯,連續一分多鐘間隔打印如下log,可見binder thread不足,有很大可能性有Binder通訊一直卡住沒退出。
18:04:37.921 3672 3816 W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31
..........
18:05:10.400 3672 3816 W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31
3.針對binder thread超的問題,從ANR trace中即可分析,以此異常爲例,檢索”Binder:3672”(Binder+異常進程PID)的關鍵字,即可查看總的binder thread信息。在此例中進程創建了32個,異常部分截圖如下。
其中,分析各個binder thread的trace,發現如下thread trace重複了很多。從trace上看,也就是WifiStateMachine中syncGetSupportedFeatures抓取支持信息,但service端沒有反饋,導致AsyncChannel卡住不退。
"Binder:3672_D" prio=5 tid=97 Waiting
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13c00a48 self=0xe3e2d200
| sysTid=7169 nice=0 cgrp=default sched=0/0 handle=0xcc298970
| state=S schedstat=( 0 0 0 ) utm=2083 stm=2938 core=1 HZ=100
| stack=0xcc19d000-0xcc19f000 stackSize=1010KB
| held mutexes=
at java.lang.Object.wait(Native method)
waiting on <0x00fac4f7> (a java.lang.Object)
at com.android.internal.util.AsyncChannel$SyncMessenger.sendMessageSynchronously(AsyncChannel.java:825)
locked <0x00fac4f7> (a java.lang.Object)
at com.android.internal.util.AsyncChannel$SyncMessenger.access$100(AsyncChannel.java:739)
at com.android.internal.util.AsyncChannel.sendMessageSynchronously(AsyncChannel.java:653)
at com.android.server.wifi.util.WifiAsyncChannel.sendMessageSynchronously(WifiAsyncChannel.java:92)
at com.android.internal.util.AsyncChannel.sendMessageSynchronously(AsyncChannel.java:666)
at com.android.server.wifi.WifiStateMachine.syncGetSupportedFeatures(WifiStateMachine.java:1771)
at com.android.server.wifi.WifiServiceImpl.getSupportedFeatures(WifiServiceImpl.java:1656)
at android.net.wifi.IWifiManager$Stub.onTransact(IWifiManager.java:54)
at android.os.Binder.execTransact(Binder.java:731)
4.因爲syncGetSupportedFeatures等是通過WiFiStateMachine中的handler處理響應的。這裏卡住,需要確認WiFiStateMachine的線程狀態,從如下兩個線程trace,我們看到它和HwBinder:3672_2死鎖了。
"WifiStateMachine" prio=5 tid=49 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13341eb8 self=0xd1385400
| sysTid=4720 nice=0 cgrp=default sched=0/0 handle=0xd0845970
| state=S schedstat=( 0 0 0 ) utm=1207 stm=1968 core=3 HZ=100
| stack=0xd0742000-0xd0744000 stackSize=1042KB
| held mutexes=
at com.android.server.wifi.SupplicantStaIfaceHal.checkSupplicantStaIfaceAndLogFailure(SupplicantStaIfaceHal.java:2287)
waiting to lock <0x0d70fee7> (a java.lang.Object) held by thread 67
at com.android.server.wifi.SupplicantStaIfaceHal.setupIface(SupplicantStaIfaceHal.java:279)
at com.android.server.wifi.WifiNative.setupInterfaceForClientMode(WifiNative.java:879)
locked <0x0af944a6> (a java.lang.Object)
at com.android.server.wifi.ClientModeManager$ClientModeStateMachine$IdleState.processMessage(ClientModeManager.java:226)
at com.android.internal.util.StateMachine$SmHandler.processMsg(StateMachine.java:992)
at com.android.internal.util.StateMachine$SmHandler.handleMessage(StateMachine.java:809)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:193)
at android.os.HandlerThread.run(HandlerThread.java:65)
"HwBinder:3672_2" prio=5 tid=67 Blocked
| group="main" sCount=1 dsCount=0 flags=1 obj=0x13141260 self=0xd13a5600
| sysTid=4751 nice=0 cgrp=default sched=0/0 handle=0xcf5d9970
| state=S schedstat=( 0 0 0 ) utm=12 stm=8 core=3 HZ=100
| stack=0xcf4de000-0xcf4e0000 stackSize=1010KB
| held mutexes=
at com.android.server.wifi.WifiNative$SupplicantDeathHandlerInternal.onDeath(WifiNative.java:524)
waiting to lock <0x0af944a6> (a java.lang.Object) held by thread 49
at com.android.server.wifi.SupplicantStaIfaceHal.supplicantServiceDiedHandler(SupplicantStaIfaceHal.java:537)
locked <0x0d70fee7> (a java.lang.Object)
at com.android.server.wifi.SupplicantStaIfaceHal.lambda$new$1(SupplicantStaIfaceHal.java:142)
locked <0x0d70fee7> (a java.lang.Object)
at com.android.server.wifi.-$$Lambda$SupplicantStaIfaceHal$MsPuzKcT4xAfuigKAAOs1rYm9CU.serviceDied(lambda:-1)
at android.os.HwRemoteBinder.sendDeathNotice(HwRemoteBinder.java:62)
5.從trace上看,異常發生在有supplicant dead因而hwbinder回調onDeath,同時WiFiManager有發起相關請求給WiFiStateMachine處理時,撞上T2.3恰好在T1.2與T1.5之間執行,T1和T2便發生死鎖
[T1.1] WifiNative.setupInterfaceForClientMode
[T1.2] == acquire WifiNative.mLock
[T1.3] --> SupplicantStaIfaceHal.setupIface
[T1.4] -->SupplicantStaIfaceHal.checkSupplicantStaIfaceAndLogFailure
[T1.5] == acquire SupplicantStaIfaceHal.mLock
[T2.1] <lambda>()
[T2.2] --> SupplicantStaIfaceHal.supplicantServiceDiedHandler
[T2.3] == acquire SupplicantStaIfaceHal.mLock
[T2.4] --> WifiNative$SupplicantDeathHandlerInternal.onDeath
[T2.5] == acquire acquire WifiNative.mLock
6.由於WifiStateMachine線程卡住,導致WifiManager的請求沒能響應,而客戶應用不斷ANR重啓,重複調用接口,最終導致大量請求阻塞,繼而引發binder thread超限異常。
通過review代碼,確認mDeathEventHandler引用的位置不多,實際並不需要通過mLock保護,則對應的修正方式也就是將如下文件的supplicantServiceDiedHandler中對onDeath的調用拿出mLock中,打破死鎖條件即可。
https://android.googlesource.com/platform/frameworks/opt/net/wifi/+/master/service/java/com/android/server/wifi/SupplicantStaIfaceHal.java
這一題最初分析到死鎖,但主線程是空閒的,修正死鎖問題後,我便請Owner review確認措施,後面也請owner幫忙確認binder超限是否與此有關,不過人家說他忙,不肯幫忙看,最後也就只能自己快速擼一遍WifiServiceImpl、WiFiStateMachine和AsyncChannel的代碼和通訊機制,才得以蓋章定論。
不過因爲我待處理的問題多,所以這次看得也匆忙,還有很多細節沒有深入瞭解,印象深刻的是其中作爲橋樑的AsyncChannel通訊機制,可以實現線程間甚至進程間的通訊,感覺這個模型挺實用的。準備在找時間完整過下這一塊的代碼,然後再寫一篇代碼分析筆記,總體感覺這一塊是挺有趣的。