WifiStateMachine死鎖導致Binder Thread超限觸發Watchdog重啓

0.系統信息:問題發生的系統是Android P,跑壓測復現到,復現過兩次,目前看起來是google原生的bug。

1.異常報如下trace,也就是在等binderThread創建超時

Watchdog: Reporting stuck state to activity controller
MonitorActivityController: ** ERROR: PROCESS NOT RESPONDING
MonitorActivityController: message: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
MonitorActivityController: #
MonitorActivityController: Allowing system to die.
Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Blocked in monitor com.android.server.Watchdog$BinderThreadMonitor on foreground thread (android.fg)
Watchdog: android.fg annotated stack trace:
Watchdog:     at android.os.Binder.blockUntilThreadAvailable(Native Method)
Watchdog:     at com.android.server.Watchdog$BinderThreadMonitor.monitor(Watchdog.java:234)
Watchdog:     at com.android.server.Watchdog$HandlerChecker.run(Watchdog.java:206)
Watchdog:     at android.os.Handler.handleCallback(Handler.java:873)
Watchdog:     at android.os.Handler.dispatchMessage(Handler.java:99)
Watchdog:     at android.os.Looper.loop(Looper.java:193)
Watchdog:     at android.os.HandlerThread.run(HandlerThread.java:65)
Watchdog:     at com.android.server.ServiceThread.run(ServiceThread.java:44)
Watchdog: *** GOODBYE!

2.往上回溯,連續一分多鐘間隔打印如下log,可見binder thread不足,有很大可能性有Binder通訊一直卡住沒退出。

18:04:37.921  3672  3816 W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31
..........
18:05:10.400  3672  3816 W IPCThreadState: Waiting for thread to be free. mExecutingThreadsCount=32 mMaxThreads=31

3.針對binder thread超的問題,從ANR trace中即可分析,以此異常爲例,檢索”Binder:3672”(Binder+異常進程PID)的關鍵字,即可查看總的binder thread信息。在此例中進程創建了32個,異常部分截圖如下。異常binder thread
  其中,分析各個binder thread的trace,發現如下thread trace重複了很多。從trace上看,也就是WifiStateMachine中syncGetSupportedFeatures抓取支持信息,但service端沒有反饋,導致AsyncChannel卡住不退。

"Binder:3672_D" prio=5 tid=97 Waiting
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13c00a48 self=0xe3e2d200
  | sysTid=7169 nice=0 cgrp=default sched=0/0 handle=0xcc298970
  | state=S schedstat=( 0 0 0 ) utm=2083 stm=2938 core=1 HZ=100
  | stack=0xcc19d000-0xcc19f000 stackSize=1010KB
  | held mutexes=
  at java.lang.Object.wait(Native method)
  waiting on <0x00fac4f7> (a java.lang.Object)
  at com.android.internal.util.AsyncChannel$SyncMessenger.sendMessageSynchronously(AsyncChannel.java:825)
  locked <0x00fac4f7> (a java.lang.Object)
  at com.android.internal.util.AsyncChannel$SyncMessenger.access$100(AsyncChannel.java:739)
  at com.android.internal.util.AsyncChannel.sendMessageSynchronously(AsyncChannel.java:653)
  at com.android.server.wifi.util.WifiAsyncChannel.sendMessageSynchronously(WifiAsyncChannel.java:92)
  at com.android.internal.util.AsyncChannel.sendMessageSynchronously(AsyncChannel.java:666)
  at com.android.server.wifi.WifiStateMachine.syncGetSupportedFeatures(WifiStateMachine.java:1771)
  at com.android.server.wifi.WifiServiceImpl.getSupportedFeatures(WifiServiceImpl.java:1656)
  at android.net.wifi.IWifiManager$Stub.onTransact(IWifiManager.java:54)
  at android.os.Binder.execTransact(Binder.java:731)

4.因爲syncGetSupportedFeatures等是通過WiFiStateMachine中的handler處理響應的。這裏卡住,需要確認WiFiStateMachine的線程狀態,從如下兩個線程trace,我們看到它和HwBinder:3672_2死鎖了。

"WifiStateMachine" prio=5 tid=49 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13341eb8 self=0xd1385400
  | sysTid=4720 nice=0 cgrp=default sched=0/0 handle=0xd0845970
  | state=S schedstat=( 0 0 0 ) utm=1207 stm=1968 core=3 HZ=100
  | stack=0xd0742000-0xd0744000 stackSize=1042KB
  | held mutexes=
  at com.android.server.wifi.SupplicantStaIfaceHal.checkSupplicantStaIfaceAndLogFailure(SupplicantStaIfaceHal.java:2287)
  waiting to lock <0x0d70fee7> (a java.lang.Object) held by thread 67
  at com.android.server.wifi.SupplicantStaIfaceHal.setupIface(SupplicantStaIfaceHal.java:279)
  at com.android.server.wifi.WifiNative.setupInterfaceForClientMode(WifiNative.java:879)
  locked <0x0af944a6> (a java.lang.Object)
  at com.android.server.wifi.ClientModeManager$ClientModeStateMachine$IdleState.processMessage(ClientModeManager.java:226)
  at com.android.internal.util.StateMachine$SmHandler.processMsg(StateMachine.java:992)
  at com.android.internal.util.StateMachine$SmHandler.handleMessage(StateMachine.java:809)
  at android.os.Handler.dispatchMessage(Handler.java:106)
  at android.os.Looper.loop(Looper.java:193)
  at android.os.HandlerThread.run(HandlerThread.java:65)
 
"HwBinder:3672_2" prio=5 tid=67 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x13141260 self=0xd13a5600
  | sysTid=4751 nice=0 cgrp=default sched=0/0 handle=0xcf5d9970
  | state=S schedstat=( 0 0 0 ) utm=12 stm=8 core=3 HZ=100
  | stack=0xcf4de000-0xcf4e0000 stackSize=1010KB
  | held mutexes=
  at com.android.server.wifi.WifiNative$SupplicantDeathHandlerInternal.onDeath(WifiNative.java:524)
  waiting to lock <0x0af944a6> (a java.lang.Object) held by thread 49
  at com.android.server.wifi.SupplicantStaIfaceHal.supplicantServiceDiedHandler(SupplicantStaIfaceHal.java:537)
  locked <0x0d70fee7> (a java.lang.Object)
  at com.android.server.wifi.SupplicantStaIfaceHal.lambda$new$1(SupplicantStaIfaceHal.java:142)
  locked <0x0d70fee7> (a java.lang.Object)
  at com.android.server.wifi.-$$Lambda$SupplicantStaIfaceHal$MsPuzKcT4xAfuigKAAOs1rYm9CU.serviceDied(lambda:-1)
  at android.os.HwRemoteBinder.sendDeathNotice(HwRemoteBinder.java:62)

5.從trace上看,異常發生在有supplicant dead因而hwbinder回調onDeath,同時WiFiManager有發起相關請求給WiFiStateMachine處理時,撞上T2.3恰好在T1.2與T1.5之間執行,T1和T2便發生死鎖

[T1.1] WifiNative.setupInterfaceForClientMode
[T1.2] == acquire WifiNative.mLock
[T1.3]     --> SupplicantStaIfaceHal.setupIface
[T1.4]         -->SupplicantStaIfaceHal.checkSupplicantStaIfaceAndLogFailure
[T1.5]         == acquire SupplicantStaIfaceHal.mLock
[T2.1] <lambda>()
[T2.2]     --> SupplicantStaIfaceHal.supplicantServiceDiedHandler
[T2.3]     == acquire SupplicantStaIfaceHal.mLock
[T2.4]        --> WifiNative$SupplicantDeathHandlerInternal.onDeath
[T2.5]        == acquire acquire WifiNative.mLock

6.由於WifiStateMachine線程卡住,導致WifiManager的請求沒能響應,而客戶應用不斷ANR重啓,重複調用接口,最終導致大量請求阻塞,繼而引發binder thread超限異常。
  通過review代碼,確認mDeathEventHandler引用的位置不多,實際並不需要通過mLock保護,則對應的修正方式也就是將如下文件的supplicantServiceDiedHandler中對onDeath的調用拿出mLock中,打破死鎖條件即可。

https://android.googlesource.com/platform/frameworks/opt/net/wifi/+/master/service/java/com/android/server/wifi/SupplicantStaIfaceHal.java

Bug Fix
  這一題最初分析到死鎖,但主線程是空閒的,修正死鎖問題後,我便請Owner review確認措施,後面也請owner幫忙確認binder超限是否與此有關,不過人家說他忙,不肯幫忙看,最後也就只能自己快速擼一遍WifiServiceImpl、WiFiStateMachine和AsyncChannel的代碼和通訊機制,才得以蓋章定論。
  不過因爲我待處理的問題多,所以這次看得也匆忙,還有很多細節沒有深入瞭解,印象深刻的是其中作爲橋樑的AsyncChannel通訊機制,可以實現線程間甚至進程間的通訊,感覺這個模型挺實用的。準備在找時間完整過下這一塊的代碼,然後再寫一篇代碼分析筆記,總體感覺這一塊是挺有趣的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章