Watchdog工作流程

WatchDog工作原理

[基於 Android P]

先看下MTK關於watchdog原理解釋:
在這裏插入圖片描述
這個只是我們學習前的一個概論,具體代碼詳解如下。

1. SystemServer.startOtherServices

private void startOtherServices() {
	final Context context = mSystemContext;
	...
	try{
		...
		traceBeginAndSlog("InitWatchdog");
		//【2】實例化
        final Watchdog watchdog = Watchdog.getInstance();
		//【3】初始化
        watchdog.init(context, mActivityManagerService);
        traceEnd();
		...
	    traceBeginAndSlog("StartWatchdog");
		//【4】啓動
        Watchdog.getInstance().start();
        traceEnd();
		...
	}catch (RuntimeException e) {
        Slog.e("System", "******************************************");
        Slog.e("System", "************ Failure starting core service", e);
    }
	...
       
}

Watchdog繼承Thread類,使用單例模式實例化,調用自身init方法初始化。

2. Watchdog.getInstance

public static Watchdog getInstance() {
    if (sWatchdog == null) {
        sWatchdog = new Watchdog();
    }
    return sWatchdog;
}

實例化watchdog

2.1. Watchdog.Watchdog

private Watchdog() {
    super("watchdog");
    // 爲我們要檢查的每個公共線程初始化處理程序檢查器。
    // 請注意,我們當前沒有檢查後臺線程,
    // 因爲它可能會保留更長時間的運行操作,
    // 而不保證其中的操作的及時性。

    // 添加android.fg線程監控
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
            "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // 添加 main 線程監控器
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
            "main thread", DEFAULT_TIMEOUT));
    // 添加android.ui線程監控
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
            "ui thread", DEFAULT_TIMEOUT));
    // 添加android.io線程監控
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
            "i/o thread", DEFAULT_TIMEOUT));
    // 添加android.display線程監控
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
            "display thread", DEFAULT_TIMEOUT));

    // 初始化binder線程監控
    addMonitor(new BinderThreadMonitor());
	// 加載fd 監控 open次數保存在/proc/self/fd/中
    mOpenFdMonitor = OpenFdMonitor.create();

    // See the notes on DEFAULT_TIMEOUT.
    assert DB ||
            DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}

3. Watchdog.init

public void init(Context context, ActivityManagerService activity) {
    mResolver = context.getContentResolver();
    mActivity = activity;

    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);
}

這裏註冊一個接收重啓廣播的Receiver,也就是所謂的軟重啓。

3.1 RebootRequestReceiver.onReceiver

final class RebootRequestReceiver extends BroadcastReceiver {
    @Override
    public void onReceive(Context c, Intent intent) {
        if (intent.getIntExtra("nowait", 0) != 0) {
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        }
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    }
}

RebootRequestReceiver的onReceiver方法調用rebootSystem(PMS的reboot操作)執行手機重啓。

4. Watchdog.getInstance().start()

因爲Watchdog本身是個Thread,所以它的start方法會調用自身的run方法。

Watchdog.run():

static final boolean DB = false;
static final long DEFAULT_TIMEOUT = DB ? 10*1000 : 60*1000;
static final long CHECK_INTERVAL = DEFAULT_TIMEOUT / 2;//30s

@Override
public void run() {
    boolean waitedHalf = false;
    while (true) {
        final List<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            long timeout = CHECK_INTERVAL;//30s
            //每30s輪詢所有的monitor
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
				//【5】
                hc.scheduleCheckLocked();
            }
            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }
            // 確保30s之後執行下面的代碼(防止wait(timeout)發生中斷)
            long start = SystemClock.uptimeMillis();
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }

            boolean fdLimitTriggered = false;
            if (mOpenFdMonitor != null) {
                fdLimitTriggered = mOpenFdMonitor.monitor();
            }
			//評估monitor完成狀態,並做相應操作
            if (!fdLimitTriggered) {
				//【6】
                final int waitState = evaluateCheckerCompletionLocked();
                if (waitState == COMPLETED) {
                    //已完成,跳過
                    waitedHalf = false;
                    continue;
                } else if (waitState == WAITING) {
                    //waiting狀態,但並未超過timeout
                    continue;
                } else if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        //block 30s時候先dump一次system_server和一些native的 stack
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            getInterestingNativePids());
                        waitedHalf = true;
						//waitedHalf這個變量保證下一次過來還是當前狀態不用dump堆棧,交給下面部分去dump.
                    }
                    continue;
                }

                // 如果狀態是 overdue!,也就是超過60秒
                blockedCheckers = getBlockedCheckersLocked();//【7】
                subject = describeCheckersLocked(blockedCheckers);
            } else {
                blockedCheckers = Collections.emptyList();
                subject = "Open FD high water mark reached";
            }
            allowRestart = mAllowRestart;
        }
		//代碼執行到這裏說明此時system_server中的監控線程已經卡住並且超過60s,
		//此時會dump堆棧並kill system_server 然後restart
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);

        ArrayList<Integer> pids = new ArrayList<>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
		//dump即將被kill進程的堆棧【8】
        final File stack = ActivityManagerService.dumpStackTraces(
                !waitedHalf, pids, null, null, getInterestingNativePids());

        // 多留一點時間保證dump信息可以保存完整
        SystemClock.sleep(2000);

        // 觸發內核來dump所有被block的線程,並輸出所有CPU上堆棧到kernel log中【9】
        doSysRq('w');
        doSysRq('l');

        // Try to add the error to the dropbox
        Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
                public void run() {
                    mActivity.addErrorToDropBox(
                            "watchdog", null, "system_server", null, null,
                            subject, null, stack, null);
                }
            };
        dropboxThread.start();
        try {
            dropboxThread.join(2000);  // wait up to 2 seconds for it to return.
        } catch (InterruptedException ignored) {}

        IActivityController controller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            Slog.i(TAG, "Reporting stuck state to activity controller");
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                // 1 = keep waiting, -1 = kill system
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    Slog.i(TAG, "Activity controller requested to coninue to wait");
                    waitedHalf = false;
                    continue;
                }
            } catch (RemoteException e) {
            }
        }

        // Only kill the process if the debugger is not attached.
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
            Slog.w(TAG, "*** GOODBYE!");
			// kill 掉system_server
            Process.killProcess(Process.myPid());
            System.exit(10);
        }

        waitedHalf = false;
    }
}

這個方法是watchdog監控的核心:

根據waitState狀態來執行不同的操作:

  • 當COMPLETED或WAITING,則直接return;
  • 當WAITED_HALF(超過30s)且爲首次, 則輸出system_server和一些Native進程的traces;
  • 當OVERDUE, 則dump更多信息.

下面詳細分析這個方法:

  • [5] hc.scheduleCheckLocked(); // 執行所有的Checker的monitor
  • [6] evaluateCheckerCompletionLocked();//檢測handlerchecker完成狀態
  • [7] getBlockedCheckersLocked() //獲取卡住60s的hanlerchecker
  • [8] ActivityManagerService.dumpStackTraces //dump callstack
  • [9] doSysRq(); //dump kernel log

5. Watchdog.HandlerChecker.scheduleCheckLocked

public final class HandlerChecker implements Runnable {
    private final Handler mHandler;
    private final String mName;
    private final long mWaitMax;
    private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
    private boolean mCompleted;
    private Monitor mCurrentMonitor;
    private long mStartTime;

    HandlerChecker(Handler handler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis;
        mCompleted = true;
    }

    public void addMonitor(Monitor monitor) {
        mMonitors.add(monitor);
    }
	
    public void scheduleCheckLocked() {
        if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
            //當mMonitor個數爲0(除了android.fg線程之外都爲0)且處於poll狀態,則設置mCompleted = true;
            mCompleted = true;
            return;
        }

        if (!mCompleted) {
            //當上次check還沒有完成, 則直接返回.
            return;
        }

        mCompleted = false;
        mCurrentMonitor = null;
        mStartTime = SystemClock.uptimeMillis();//爲每個checker設置startTime
        mHandler.postAtFrontOfQueue(this);//發送消息,插入消息隊列最開頭
    }

	......
}

mHandler.postAtFrontOfQueue(this): 該方法輸入參數爲Runnable對象,根據消息機制, 最終會回調HandlerChecker中的run方法。

5.1. HandlerChecker.run

[-> Watchdog.java]

@Override
public void run() {
    final int size = mMonitors.size();
    for (int i = 0 ; i < size ; i++) {
        synchronized (Watchdog.this) {
            mCurrentMonitor = mMonitors.get(i);
        }
		//回調實現Watchdog.Monitor的Service的monitor方法	
        mCurrentMonitor.monitor();
    }

    synchronized (Watchdog.this) {
        mCompleted = true;
        mCurrentMonitor = null;
    }
}

run方法會循環遍歷所有的Monitor接口,具體的服務實現該接口的monitor()方法,執行完成後會設置mCompleted = true. 那麼當handler消息池當前的消息, 導致遲遲沒有機會執行monitor()方法, 則會觸發watchdog.

回調實現Watchdog.Monitor的Service的monitor方法以AMS爲例:

public class ActivityManagerService extends IActivityManager.Stub
        implements Watchdog.Monitor, BatteryStatsImpl.BatteryCallback {
	...
    public ActivityManagerService(Context systemContext) {
		...
		Watchdog.getInstance().addMonitor(this);
        Watchdog.getInstance().addThread(mHandler);
		...
	}
	// synchronized避免死鎖
    public void monitor() {
        synchronized (this) { }
    }
	...
}

6. Watchdog.HandlerChecker.evaluateCheckerCompletionLocked();

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}

evaluateCheckerCompletionLocked()獲取mHandlerCheckers列表中等待狀態值最大的state.

getCompletionStateLocked():

  • COMPLETED = 0:等待完成;
  • WAITING = 1:等待時間小於DEFAULT_TIMEOUT的一半,即30s;
  • WAITED_HALF = 2:等待時間處於30s~60s之間;
  • OVERDUE = 3:等待時間大於或等於60s。

7. Watchdog.getBlockedCheckersLocked()

private ArrayList<HandlerChecker> getBlockedCheckersLocked() {
    ArrayList<HandlerChecker> checkers = new ArrayList<HandlerChecker>();
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerChecker hc = mHandlerCheckers.get(i);
		//將所有沒有完成,且超時的checker加入隊列
        if (hc.isOverdueLocked()) {
            checkers.add(hc);
        }
    }
    return checkers;
}

8. ActivityManagerService.dumpStackTraces

這篇文章主要看watchdog的監控流程,這裏dump相關堆棧,不做深入分析了,doSysRq()也一樣。
。。。。

整個watchdog詳細版流程圖如下:

(網上發現一個流程圖,畫的很詳細,肯定比我畫的詳細,借鑑借鑑)
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章