WatchDog工作原理
[基於 Android P]
先看下MTK關於watchdog原理解釋:
這個只是我們學習前的一個概論,具體代碼詳解如下。
1. SystemServer.startOtherServices
private void startOtherServices() {
final Context context = mSystemContext;
...
try{
...
traceBeginAndSlog("InitWatchdog");
//【2】實例化
final Watchdog watchdog = Watchdog.getInstance();
//【3】初始化
watchdog.init(context, mActivityManagerService);
traceEnd();
...
traceBeginAndSlog("StartWatchdog");
//【4】啓動
Watchdog.getInstance().start();
traceEnd();
...
}catch (RuntimeException e) {
Slog.e("System", "******************************************");
Slog.e("System", "************ Failure starting core service", e);
}
...
}
Watchdog繼承Thread類,使用單例模式實例化,調用自身init方法初始化。
2. Watchdog.getInstance
public static Watchdog getInstance() {
if (sWatchdog == null) {
sWatchdog = new Watchdog();
}
return sWatchdog;
}
實例化watchdog
2.1. Watchdog.Watchdog
private Watchdog() {
super("watchdog");
// 爲我們要檢查的每個公共線程初始化處理程序檢查器。
// 請注意,我們當前沒有檢查後臺線程,
// 因爲它可能會保留更長時間的運行操作,
// 而不保證其中的操作的及時性。
// 添加android.fg線程監控
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
// 添加 main 線程監控器
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
// 添加android.ui線程監控
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
// 添加android.io線程監控
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
// 添加android.display線程監控
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
// 初始化binder線程監控
addMonitor(new BinderThreadMonitor());
// 加載fd 監控 open次數保存在/proc/self/fd/中
mOpenFdMonitor = OpenFdMonitor.create();
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}
3. Watchdog.init
public void init(Context context, ActivityManagerService activity) {
mResolver = context.getContentResolver();
mActivity = activity;
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
這裏註冊一個接收重啓廣播的Receiver,也就是所謂的軟重啓。
3.1 RebootRequestReceiver.onReceiver
final class RebootRequestReceiver extends BroadcastReceiver {
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
RebootRequestReceiver的onReceiver方法調用rebootSystem(PMS的reboot操作)執行手機重啓。
4. Watchdog.getInstance().start()
因爲Watchdog本身是個Thread,所以它的start方法會調用自身的run方法。
Watchdog.run():
static final boolean DB = false;
static final long DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000;
static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;//30s
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
final List<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
long timeout = CHECK_INTERVAL;//30s
//每30s輪詢所有的monitor
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//【5】
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// 確保30s之後執行下面的代碼(防止wait(timeout)發生中斷)
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
boolean fdLimitTriggered = false;
if (mOpenFdMonitor != null) {
fdLimitTriggered = mOpenFdMonitor.monitor();
}
//評估monitor完成狀態,並做相應操作
if (!fdLimitTriggered) {
//【6】
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
//已完成,跳過
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
//waiting狀態,但並未超過timeout
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
//block 30s時候先dump一次system_server和一些native的 stack
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
ActivityManagerService.dumpStackTraces(true, pids, null, null,
getInterestingNativePids());
waitedHalf = true;
//waitedHalf這個變量保證下一次過來還是當前狀態不用dump堆棧,交給下面部分去dump.
}
continue;
}
// 如果狀態是 overdue!,也就是超過60秒
blockedCheckers = getBlockedCheckersLocked();//【7】
subject = describeCheckersLocked(blockedCheckers);
} else {
blockedCheckers = Collections.emptyList();
subject = "Open FD high water mark reached";
}
allowRestart = mAllowRestart;
}
//代碼執行到這裏說明此時system_server中的監控線程已經卡住並且超過60s,
//此時會dump堆棧並kill system_server 然後restart
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
//dump即將被kill進程的堆棧【8】
final File stack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, getInterestingNativePids());
// 多留一點時間保證dump信息可以保存完整
SystemClock.sleep(2000);
// 觸發內核來dump所有被block的線程,並輸出所有CPU上堆棧到kernel log中【9】
doSysRq('w');
doSysRq('l');
// Try to add the error to the dropbox
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) {}
IActivityController controller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
// kill 掉system_server
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
這個方法是watchdog監控的核心:
根據waitState狀態來執行不同的操作:
- 當COMPLETED或WAITING,則直接return;
- 當WAITED_HALF(超過30s)且爲首次, 則輸出system_server和一些Native進程的traces;
- 當OVERDUE, 則dump更多信息.
下面詳細分析這個方法:
- [5] hc.scheduleCheckLocked(); // 執行所有的Checker的monitor
- [6] evaluateCheckerCompletionLocked();//檢測handlerchecker完成狀態
- [7] getBlockedCheckersLocked() //獲取卡住60s的hanlerchecker
- [8] ActivityManagerService.dumpStackTraces //dump callstack
- [9] doSysRq(); //dump kernel log
5. Watchdog.HandlerChecker.scheduleCheckLocked
public final class HandlerChecker implements Runnable {
private final Handler mHandler;
private final String mName;
private final long mWaitMax;
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private boolean mCompleted;
private Monitor mCurrentMonitor;
private long mStartTime;
HandlerChecker(Handler handler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
public void addMonitor(Monitor monitor) {
mMonitors.add(monitor);
}
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
//當mMonitor個數爲0(除了android.fg線程之外都爲0)且處於poll狀態,則設置mCompleted = true;
mCompleted = true;
return;
}
if (!mCompleted) {
//當上次check還沒有完成, 則直接返回.
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();//爲每個checker設置startTime
mHandler.postAtFrontOfQueue(this);//發送消息,插入消息隊列最開頭
}
......
}
mHandler.postAtFrontOfQueue(this): 該方法輸入參數爲Runnable對象,根據消息機制, 最終會回調HandlerChecker中的run方法。
5.1. HandlerChecker.run
[-> Watchdog.java]
@Override
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
//回調實現Watchdog.Monitor的Service的monitor方法
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
run方法會循環遍歷所有的Monitor接口,具體的服務實現該接口的monitor()方法,執行完成後會設置mCompleted = true. 那麼當handler消息池當前的消息, 導致遲遲沒有機會執行monitor()方法, 則會觸發watchdog.
回調實現Watchdog.Monitor的Service的monitor方法以AMS爲例:
public class ActivityManagerService extends IActivityManager.Stub
implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
...
public ActivityManagerService(Context systemContext) {
...
Watchdog.getInstance().addMonitor(this);
Watchdog.getInstance().addThread(mHandler);
...
}
// synchronized避免死鎖
public void monitor() {
synchronized (this) { }
}
...
}
6. Watchdog.HandlerChecker.evaluateCheckerCompletionLocked();
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
evaluateCheckerCompletionLocked()獲取mHandlerCheckers列表中等待狀態值最大的state.
getCompletionStateLocked():
- COMPLETED = 0:等待完成;
- WAITING = 1:等待時間小於DEFAULT_TIMEOUT的一半,即30s;
- WAITED_HALF = 2:等待時間處於30s~60s之間;
- OVERDUE = 3:等待時間大於或等於60s。
7. Watchdog.getBlockedCheckersLocked()
private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
//將所有沒有完成,且超時的checker加入隊列
if (hc.isOverdueLocked()) {
checkers.add(hc);
}
}
return checkers;
}
8. ActivityManagerService.dumpStackTraces
這篇文章主要看watchdog的監控流程,這裏dump相關堆棧,不做深入分析了,doSysRq()也一樣。
。。。。
整個watchdog詳細版流程圖如下:
(網上發現一個流程圖,畫的很詳細,肯定比我畫的詳細,借鑑借鑑)