Linux 內核中的 Soft 和 Hard Lockup【轉】

轉自:https://blog.csdn.net/xiaojunling/article/details/89248467

這周遇到了一個內核關於 softlockup 和 hardlockup 相關的 bug, 首先
內核文檔中找到了關於他們的定義和實現的介紹的非常詳細,還在網上找到了
更多關於他們的介紹和很細可以查看文後參考的博客

1. 首先來介紹下 softlockup 和 hardlockup 在內核中怎麼定義的:

softlockup 是導致內核在內核態下循環超過20秒(這個時間是可以通過內核參數設置的)
的錯誤,而不給其他任務提供運行機會。檢測時顯示當前堆棧跟蹤,默認情況下,系統將保持
鎖定狀態。或者,內核可以配置爲 kernel panic; 通過 sysctl kernel.softlockup_panic,內
核提供了一個參數 softlockup_panic,併爲此提供了編譯選項 BOOTPARAM_SOFTLOCKUP_PANIC

hardlockup 是導致 CPU 在內核態下循環超過10秒的錯誤,而不會讓其他中斷有機會運行。
與 softlockup 情況類似,當檢測到時會顯示當前堆棧跟蹤,除非更改默認行爲,否則系統將保持鎖定
狀態,這個狀態也可以通過 sysctl hardlockup_panic進行修改,編譯內核的選項是BOOTPARAM_HARDLOCKUP_PANIC,還有一個和其有關的內核參數 nmi_watchdog

panic 選項可以與 panic_timeout 結合使用(可以kernel.panic sysctl設置),以使系統在指定的時間後自動重啓。

具體相關的詳細參數可以參考內核中的 “Documentation/admin-guide/kernel-parameters.rst”

總的說來:

  • softlockup 是該 CPU 的無法調度到其他的進程運行
  • hardlockup 是該 CPU 不僅進程無法調度,而且中斷也不能運行了(NMI除外, NMI是不可屏蔽的中斷)

這裏說的 lockup 是指運行在內核態的代碼 lockup,用戶態的代碼是可以被搶佔的,還有就是內核代碼必須處於
禁止內核搶佔的狀態(preemption disabled), 因爲Linux內核是支持搶佔的,只有在一些特定的代碼段中才
禁止內核搶佔,在這些代碼段中才有可嫩發生 lockup

2. softlockup 和 hardlockup 原理

soft 和 hard lockup 是基於 hrtimer 和 perf 子系統實現的

hardlockup 實現的原理:

hardlockup 利用週期性的 hrtimer 運行以生成中斷並啓動監視任務。每個 watchdog_thresh
(編譯時初始化爲 10, 並可通過sysctl進行配置)秒生成 NMI perf事件,以檢查hardlockup。
如果系統中的任何CPU在此期間沒有收到任何 hrtimer 中斷,則 hardlockup detector(NMI perf事件的處理程序)
將生成內核警告或調用恐慌,具體取決於配置。

softlockup 實現的原理:

watchdog 任務是一個高優先級內核線程,每次調度時都會更新一個時間戳。如果該時間戳沒有更新 2 * watchdog_thresh 秒(軟鎖定閾值),則 softlockup detector (在hrtimer回調函數內編碼)會將有用的調試信息轉儲到系統日誌中,之後
它將根據系統設置是 調用 panic 或 繼續執行其他內核代碼

hrtimer 的週期是 2 * watchdog_thresh / 5,這意味着它有兩到三次機會在硬件鎖定檢測器啓動之前產生中斷。

默認情況下,每個激活的 CPU 上都運行一個 watchdog 線程,命名爲 [watchdog/%d]。但是,在配置了 NO_HZ_FULL
的內核上,默認情況下,watchdog 僅在覈心上運行,而不在 nohz_full 引導參數中指定的核心上運行。
如果我們允許看門狗默認運行在“nohz_full”內核上,我們必須運行定時器滴答來激活調度程序,這將阻止 nohz_full
功能保護這些內核上的用戶代碼不受內核影響。

當然,默認情況下在 nohz_full 核心上禁用它意味着當這些 CPU核 進入內核時,默認情況下我們將無法檢測它們是否鎖定。
但是,允許 watchdog 繼續在 housekeeping(non-tickless)核上運行意味着我們將繼續在這些內核上正確檢測鎖定。

在任何一種情況下,可以通過 kernel.watchdog_cpumask sysctl調整排除運行看門狗的核心集。對於 nohz_full核,
這可能對調試內核似乎掛在 nohz_full 內核上的情況很有用。

softlockup_panic=
        [KNL] Should the soft-lockup detector generate panics.
        Format: <integer>

nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
        Format: [panic,][nopanic,][num]
        Valid num: 0 or 1
        0 - turn hardlockup detector in nmi_watchdog off
        1 - turn hardlockup detector in nmi_watchdog on
        When panic is specified, panic when an NMI watchdog
        timeout occurs (or 'nopanic' to override the opposite
        default). To disable both hard and soft lockup detectors,
        please see 'nowatchdog'.
        This is useful when you use a panic=... timeout and
        need the box quickly up again.

watchdog_cpumask:

This value can be used to control on which cpus the watchdog may run.
The default cpumask is all possible cores, but if NO_HZ_FULL is
enabled in the kernel config, and cores are specified with the
nohz_full= boot argument, those cores are excluded by default.
Offline cores can be included in this mask, and if the core is later
brought online, the watchdog will be started based on the mask value.

Typically this value would only be touched in the nohz_full case
to re-enable cores that by default were not running the watchdog,
if a kernel lockup was suspected on those cores.

The argument value is the standard cpulist format for cpumasks,
so for example to enable the watchdog on cores 0, 2, 3, and 4 you
might say:

  echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask

 

3. 代碼實現與分析

下面我們以 Linux 4.14 的內核版本來分析 Linux 如何實現這兩種lockup的探測的:

soft 和 hard lockup 是基於 hrtimer 和 perf 子系統實現的

這裏我們可以看到要把 watchdog 註冊到內核中去,這裏的 watchdog 不是硬件的 watchdog,而是通過 NMI 來模擬
一個監控程序起個名字叫 watchdog 而已

從下面的代碼可以看到在系統啓動轉載這個模塊的時候,會爲每個 CPU core 註冊一個 kernel 線程,名字爲 watchdgo/%u
這個線程會定期調用 watchdog 函數

static struct smp_hotplug_thread watchdog_threads = {
        .store                  = &softlockup_watchdog,
        .thread_should_run      = watchdog_should_run,
        .thread_fn              = watchdog,
        .thread_comm            = "watchdog/%u",
        .setup                  = watchdog_enable,
        .cleanup                = watchdog_cleanup,
        .park                   = watchdog_disable,
        .unpark                 = watchdog_enable,
};

/*
 * Create the watchdog thread infrastructure and configure the detector(s).
 *
 * The threads are not unparked as watchdog_allowed_mask is empty.  When
 * the threads are sucessfully initialized, take the proper locks and
 * unpark the threads in the watchdog_cpumask if the watchdog is enabled.
 */
static __init void lockup_detector_setup(void)
{
        int ret;

        /*
         * If sysctl is off and watchdog got disabled on the command line,
         * nothing to do here.
         */
        lockup_detector_update_enable();

        if (!IS_ENABLED(CONFIG_SYSCTL) &&
            !(watchdog_enabled && watchdog_thresh))
                return;

        ret = smpboot_register_percpu_thread_cpumask(&watchdog_threads,
                                                     &watchdog_allowed_mask);
        if (ret) {
                pr_err("Failed to initialize soft lockup detector threads\n");
                return;
        }

        mutex_lock(&watchdog_mutex);
        softlockup_threads_initialized = true;
        lockup_detector_reconfigure();
        mutex_unlock(&watchdog_mutex);
}

/* Return 0, if a NMI watchdog is available. Error code otherwise */
int __weak __init watchdog_nmi_probe(void)
{
        return hardlockup_detector_perf_init();
}

void __init lockup_detector_init(void)
{
#ifdef CONFIG_NO_HZ_FULL
        if (tick_nohz_full_enabled()) {
                pr_info("Disabling watchdog on nohz_full cores by default\n");
                cpumask_copy(&watchdog_cpumask, housekeeping_mask);
        } else
                cpumask_copy(&watchdog_cpumask, cpu_possible_mask);
#else
        cpumask_copy(&watchdog_cpumask, cpu_possible_mask);
#endif

        if (!watchdog_nmi_probe())
                nmi_watchdog_available = true;
        lockup_detector_setup();
}
而 watchdog 函數就是更新一個時間戳爲以後使用,在 watchdog 中設置了一個 hrtimer 定時器,並把
watchdog_timer_fn 設置爲定時器的處理函數

static void watchdog_enable(unsigned int cpu)
{
        struct hrtimer *hrtimer = this_cpu_ptr(&watchdog_hrtimer);

        /*
         * Start the timer first to prevent the NMI watchdog triggering
         * before the timer has a chance to fire.
         */
        hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        hrtimer->function = watchdog_timer_fn;
        hrtimer_start(hrtimer, ns_to_ktime(sample_period),
                      HRTIMER_MODE_REL_PINNED);

        /* Initialize timestamp */
        __touch_watchdog();
        /* Enable the perf event */
        if (watchdog_enabled & NMI_WATCHDOG_ENABLED)
                watchdog_nmi_enable(cpu);

        watchdog_set_prio(SCHED_FIFO, MAX_RT_PRIO - 1);
}

/* Commands for resetting the watchdog */
static void __touch_watchdog(void)
{
        __this_cpu_write(watchdog_touch_ts, get_timestamp());
}

static int watchdog_should_run(unsigned int cpu)
{
        return __this_cpu_read(hrtimer_interrupts) !=
                __this_cpu_read(soft_lockup_hrtimer_cnt);
}

/*
 * The watchdog thread function - touches the timestamp.
 *
 * It only runs once every sample_period seconds (4 seconds by
 * default) to reset the softlockup timestamp. If this gets delayed
 * for more than 2*watchdog_thresh seconds then the debug-printout
 * triggers in watchdog_timer_fn().
 */
static void watchdog(unsigned int cpu)
{
        __this_cpu_write(soft_lockup_hrtimer_cnt,
                         __this_cpu_read(hrtimer_interrupts));
        __touch_watchdog();
}

 

watchdog_timer_fn 中做了一些事情:

  • 更新 hrtimer_interrupts 計數
  • 喚醒 watchdog 線程,該線程會首先調用 watchdog_should_run() 檢測 watchdog 是否運行,
    只有 hrtimer_interrupts 更新, 該線程才運行

而這個函數的運行週期是多長時間呢?

可以通過初始化的 sample_period 看出是 10 * 2 / 5 = 4s,也就是說內核的 watchdog 線程和
時鐘中斷函數都是以 4 秒的時間爲週期來運行的

int __read_mostly watchdog_thresh = 10;

/*
 * Hard-lockup warnings should be triggered after just a few seconds. Soft-
 * lockups can have false positives under extreme conditions. So we generally
 * want a higher threshold for soft lockups than for hard lockups. So we couple
 * the thresholds with a factor: we make the soft threshold twice the amount of
 * time the hard threshold is.
 */
static int get_softlockup_thresh(void)
{
        return watchdog_thresh * 2;
}

static void set_sample_period(void)
{
        /*
         * convert watchdog_thresh from seconds to ns
         * the divide by 5 is to give hrtimer several chances (two
         * or three with the current relation between the soft
         * and hard thresholds) to increment before the
         * hardlockup detector generates a warning
         */
        sample_period = get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5);
        watchdog_update_hrtimer_threshold(sample_period);
}

static void watchdog_interrupt_count(void)
{
        __this_cpu_inc(hrtimer_interrupts);
}

/* watchdog kicker functions */
static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
{
        unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
        struct pt_regs *regs = get_irq_regs();
        int duration;
        int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;

        if (!watchdog_enabled)
                return HRTIMER_NORESTART;

        /* kick the hardlockup detector */
        watchdog_interrupt_count();

        /* kick the softlockup detector */
        wake_up_process(__this_cpu_read(softlockup_watchdog));

        /* .. and repeat */
        hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));

        if (touch_ts == 0) {
                if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
                        /*
                         * If the time stamp was touched atomically
                         * make sure the scheduler tick is up to date.
                         */
                        __this_cpu_write(softlockup_touch_sync, false);
                        sched_clock_tick();
                }

                /* Clear the guest paused flag on watchdog reset */
                kvm_check_and_clear_guest_paused();
                __touch_watchdog();
                return HRTIMER_RESTART;
        }
        /* check for a softlockup
         * This is done by making sure a high priority task is
         * being scheduled.  The task touches the watchdog to
         * indicate it is getting cpu time.  If it hasn't then
         * this is a good indication some task is hogging the cpu
         */
        duration = is_softlockup(touch_ts);
        if (unlikely(duration)) {
                /*
                 * If a virtual machine is stopped by the host it can look to
                 * the watchdog like a soft lockup, check to see if the host
                 * stopped the vm before we issue the warning
                 */
                if (kvm_check_and_clear_guest_paused())
                        return HRTIMER_RESTART;

                /* only warn once */
                if (__this_cpu_read(soft_watchdog_warn) == true) {
                        /*
                         * When multiple processes are causing softlockups the
                         * softlockup detector only warns on the first one
                         * because the code relies on a full quiet cycle to
                         * re-arm.  The second process prevents the quiet cycle
                         * and never gets reported.  Use task pointers to detect
                         * this.
                         */
                        if (__this_cpu_read(softlockup_task_ptr_saved) !=
                            current) {
                                __this_cpu_write(soft_watchdog_warn, false);
                                __touch_watchdog();
                        }
                        return HRTIMER_RESTART;
                }

                if (softlockup_all_cpu_backtrace) {
                        /* Prevent multiple soft-lockup reports if one cpu is already
                         * engaged in dumping cpu back traces
                         */
                        if (test_and_set_bit(0, &soft_lockup_nmi_warn)) {
                                /* Someone else will report us. Let's give up */
                                __this_cpu_write(soft_watchdog_warn, true);
                                return HRTIMER_RESTART;
                        }
                }

                pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
                        smp_processor_id(), duration,
                        current->comm, task_pid_nr(current));
                __this_cpu_write(softlockup_task_ptr_saved, current);
                print_modules();
                print_irqtrace_events(current);
                if (regs)
                        show_regs(regs);
                else
                        dump_stack();

                if (softlockup_all_cpu_backtrace) {
                        /* Avoid generating two back traces for current
                         * given that one is already made above
                         */
                        trigger_allbutself_cpu_backtrace();

                        clear_bit(0, &soft_lockup_nmi_warn);
                        /* Barrier to sync with other cpus */
                        smp_mb__after_atomic();
                }

                add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
                if (softlockup_panic)
                        panic("softlockup: hung tasks");
                __this_cpu_write(soft_watchdog_warn, true);
        } else
                __this_cpu_write(soft_watchdog_warn, false);

        return HRTIMER_RESTART;
}

 

 

該函數通過 is_softlockup() 檢測是否發生 softlockup 查看 watchdog_touch_ts 變量在最近20秒的
時間內,有沒有被創建的 watchdog 更新過,如果沒有就是線程沒有調度,說明[watchdog/x]未得到運行機
會,意味着CPU被霸佔,調度器沒有辦法進行調度,也就發生了soft lockup,這種情況下系統可能不會死掉
系統響應會非常慢,可以通過視同 top/ps 命令把佔用 CPU 的進程的相應的優先級調低進行測試

static int is_softlockup(unsigned long touch_ts)
{
        unsigned long now = get_timestamp();

        if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
                /* Warn about unreasonable delays. */
                if (time_after(now, touch_ts + get_softlockup_thresh()))
                        return now - touch_ts;
        }
        return 0;
}

 

再看看 watchdog_enable()->wachdog_nmi_enable() 這個函數是打開 hard lockup 在
watchdog_nmi_enable()->hardlockup_detector_perf_enable()->hardlockup_detector_event_create()
中通過 perf_event_create_kernel_counter 註冊一個硬件事件 watchdog_overflow_callback()

而該事件以 watchdog_thresh 爲週期運行, 時鐘中斷處理函數以 4s 爲週期運行

在 is_hardlockup 這個函數主要就是查看 hrtimer_interrupts 變量在時鐘中斷處理函數裏有沒有被更新。
假如沒有更新,就意味着中斷出了問題,可能被錯誤代碼長時間的關中斷了

u64 hw_nmi_get_sample_period(int watchdog_thresh)
{
        return (u64)(cpu_khz) * 1000 * watchdog_thresh;

}

/* watchdog detector functions */
bool is_hardlockup(void)
{
        unsigned long hrint = __this_cpu_read(hrtimer_interrupts);

        if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
                return true;

        __this_cpu_write(hrtimer_interrupts_saved, hrint);
        return false;
}

/* Callback function for perf event subsystem */
static void watchdog_overflow_callback(struct perf_event *event,
                                       struct perf_sample_data *data,
                                       struct pt_regs *regs)
{

    /* check for a hardlockup
    * This is done by making sure our timer interrupt
    * is incrementing.  The timer interrupt should have
    * fired multiple times before we overflow'd.  If it hasn't
    * then this is a good indication the cpu is stuck
    */
    if (is_hardlockup()) {
    }
}

static int hardlockup_detector_event_create(void)
{
        unsigned int cpu = smp_processor_id();
        struct perf_event_attr *wd_attr;
        struct perf_event *evt;

        wd_attr = &wd_hw_attr;
        wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);

        /* Try to register using hardware perf events */
        evt = perf_event_create_kernel_counter(wd_attr, cpu, NULL,
                                               watchdog_overflow_callback, NULL)
        if (IS_ERR(evt)) {
                pr_info("Perf event create on CPU %d failed with %ld\n", cpu,
                        PTR_ERR(evt));
                return PTR_ERR(evt);
        }
        this_cpu_write(watchdog_ev, evt);
        return 0;
}
/*
 * These functions can be overridden if an architecture implements its
 * own hardlockup detector.
 *
 * watchdog_nmi_enable/disable can be implemented to start and stop when
 * softlockup watchdog threads start and stop. The arch must select the
 * SOFTLOCKUP_DETECTOR Kconfig.
 */
int __weak watchdog_nmi_enable(unsigned int cpu)
{
        hardlockup_detector_perf_enable();
        return 0;
}

 

Reference

  1. Softlockup detector and hardlockup detector (aka nmi_watchdog)
  2. soft lockup和hard lockup介紹
  3. 內核如何檢測soft lockup與hard lockup?
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章