linux內核崩潰問題排查過程總結

1.概述

某年某月某日某項目的線上分佈式文件系統服務器多臺linux系統kernel崩潰,嚴重影響了某項目對外提供服務的能力,在公司造成了不小影響。通過排查線上問題基本確定了是由於linux內核panic造成的原因,通過兩個階段的問題排查,基本上確定了linux內核panic的原因。排查問題的主要手段就是網上查找資料和根據內核錯誤日誌分析並且構造條件重現。本文檔就是對自己在整個問題排查過程中的總結。

2.第一階段

因爲剛出現問題的時候大家都比較緊急,每天加班都很晚,也制定了很多問題重現和定位原因的計劃。我第一階段連續堅持了兩週分析問題原因,由於第一階段自己所做的功能基本上全部形成了詳細的分析文檔,所以下面主要總結一下自己在第一階段都採取了一些什麼樣的措施以及到達了什麼效果。

第一階段自己也分了幾步走,當然最先想到的是重現線上的現象,所以我首先查看了線上的內核錯誤日誌,根據日誌顯示主要是qmgr和master兩個進程導致的內核panic(至少日誌信息是這麼提示的)。當然還結合當時服務器的現象,load比較高,對外不能提供服務。所以自己首先想到的就是通過寫程序模擬不斷髮送郵件(因爲qmgr和master進程都與發送郵件相關的進程),當程序運行起來的時候,自己小小的激動了一下,就是load上去了,服務器的對外服務能力變慢了(ssh登錄),當時的線上接近線上現象,但是後面內核一直沒有panic,哪怕頻率在快,而且內核也沒有什麼錯誤信息。後面漸漸的就排除了這個原因。

因爲出錯的服務器都安裝了分佈式文件系統,大家就懷疑是由於分佈式文件系統導致了內核panic,但是通過觀察業務監控信息發現那個時段分佈式文件系統沒有什麼特殊的信息,而且數據流也不是很大。不過我還是使用幾臺虛擬機安裝了分佈式文件系統,並且寫了一個java程序並且不斷的通過分佈式文件系統客戶端寫入文件到分佈式文件系統集羣,同時也把郵件發送程序啓動,儘量模擬線上的環境,跑了很多次很長時間也沒有出現線上的現象,所以也沒有什麼更好的手段去重現線上的現象了。

由於重現現象失敗了,所以只有根據內核的錯誤信息直接去分析原因了。分析步驟很簡單,首先找到出錯的錯誤代碼,然後分析上下文相關的代碼,分析的詳細過程在去年的文檔也體現出來了。

根據代碼的分析和網上類似的bug基本上定位就是計算cpu調度的時間溢出,導致watchdog進程拋出panic錯誤,內核就掛起了。根據分析定位的原因,我又通過修改內核代碼去構造時間溢出的條件,就是通過內核模塊去修改系統調用時間的計數值,修改是成功了,可惜內核也直接死掉了。所以直接修改內核代碼來重現也失敗了。

後面也陸續諮詢了很多公司外面熟悉內核的技術人員,他們根據我們提供的信息業給出了自己的分析,但是也沒有很好的重現方法和確切的定位錯誤原因,而且不同的人給出的結論差異也比較大。

所以第一個階段連續堅持跟蹤這個問題2-3周的時間也沒有一個確切的結果。

3.第二階段

新的一年開始了,第一天又開始準備跟蹤這個問題了。一開始也制定了簡單的計劃,我對自己的計劃就是每天5-8點分析定位內核問題,當然也順便學習內核相關知識。

這一次一開始自己便換了一個角度去思考問題,去年是基於單臺服務器去分析內核日誌錯誤信息,已經沒有什麼好的方式了。所以準備同時分析所有出錯服務器的日誌(幸好以前找運維要了所有出錯服務器的內核日誌並且保存下來了,不然怎麼死的都不知道),找出他們的共同點。首先他們的共同點就是出現了trace子系統打印的警告信息“Delta way too big!…..”的信息,但是根據相關信息,這個是不會導致linux系統掛起的。而且確實我們線上的服務器並不是全部都ssh不上去,不過還是在redhat官方網站找到類似的bug(url:

https: //access.redhat.com/knowledge/solutions/70051),並且給出瞭解決方案。bug信息和解決方案如下:

why kernel is throwing “Delta way too big” out with

WARNING: at kernel trace ring_buffer,c:1988 rb_reserve_next_event+0x2ce/0×370 messages

0 Issue    kernel is throwing ”Delta way too big” out with kernel oops on server

Environment(環境)

•Red Hat Enterprise Linux 6 service pack 1

Resolution(解決方案)

The warning ”Delta way too big” warning might appear on a system with unstable shed clock right after the system is resumed and tracingwas enabled during the suspend.

Since it’s not realy bug, and the unstable sched clock is working fast and reliable otherwise, We suggested to keep using the sched clock in any case and just to make note in the warning itself.or disables tracing by #echo 0 > /sys/kernel/debug/tracing/tracing_on

Root Cause(根本原因) this case was ftrace involved ftrace due to teh call to ftrace_raw_event_power_end (debugfs is mounted and ftrace loaded in this case), they are to do with problems calculating a time stamp for a ring buffer entry.
Message comes from here and appears to be indicating problems with time stability.
1966 static int
1967 rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
1968   u64 *ts, u64 *delta)
1969 {
1970    struct ring_buffer_event *event;
1971    static int once;
1972   int ret;
1973
1974 if (unlikely(*delta > (1ULL << 59) && !once++)) {
1975    int local_clock_stable = 1;
1976 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
1977   local_clock_stable = sched_clock_stable;
1978 #endif
1979   printk(KERN_WARNING ”Delta way too big! %llu”
1980         “ ts=%llu write stamp = %llu\n%s”,
1981      (unsigned long long)*delta,
1982     (unsigned long long)*ts,
1983     (unsigned long long)cpu_buffer->write_stamp,
1984      local_clock_stable ? ”" :
1985      “If you just came from a suspend/resume,\n”
1986      “please switch to the trace global clock:\n”
1987   ”  echo global > /sys/kernel/debug/tracing/trace_clock\n”);
1988      WARN_ON(1);

This called from rb_reserve_next_event() here.
2122 /*
2123 * Only the first commit can update the timestamp.
2124 * Yes there is a race here. If an interrupt comes in
2125 * just after the conditional and it traces too, then it
2126 * will also check the deltas. More than one timestamp may
2127 * also be made. But only the entry that did the actual
2128 * commit will be something other than zero.
2129 */
2130 if (likely(cpu_buffer->tail_page == cpu_buffer->commit_page &&
2131   rb_page_write(cpu_buffer->tail_page) ==
2132  rb_commit_index(cpu_buffer))) {
2133      u64 diff;
2134
2135      diff = ts - cpu_buffer->write_stamp;
2136
2137      /* make sure this diff is calculated here */
2138    barrier();
2139
2140 /* Did the write stamp get updated already? */

2141if (unlikely(ts < cpu_buffer->write_stamp))
2142   goto get_event;
2143
2144   delta = diff;
2145   if (unlikely(test_time_stamp(delta))) {
2146
2147      commit = rb_add_time_stamp(cpu_buffer, &ts, &delta); <—- HERE
This has to do with time stamping for ring buffer entries.

通過上面的信息可以看出,其實和我去年分析的代碼和方式一模一樣,只是判斷原因方面我不敢確定,畢竟重現不了,只能是推測。

後面又繼續分析內核中出現的另一個錯誤,“BUG: soft lockup – CPU#N stuck for 4278190091s! [qmgr/master:進程號]”,對上面的錯誤信息我做了一點點處理,CPU#後面的N是對應的一個具體的cpu編號,這個在每一臺服務器是不一樣的,還有就是最後中括號中的進程和進程號碼不同,不過就是qmgr和master。如下表格統計:

IP 107 108 109 110 111 112 113 114
選項
日誌

時間13:01:2014:03:3414:05:4414:22:4414:19:5814:17:1214:22:49

14:19:58錯誤日誌類型和進程1qmgr1master

2qmgr1qmgr

2master1 qmgr

1qmgr

2master1qmgr

2master1qmgr

2master1qmgr

2master

錯誤類型1就是上面提到的不會一起內核掛起的錯誤,2就是現在分析的這個錯誤,會導致linux內核panic。可以看出只有107和110當時是沒有掛起的。

接着上面的內核出錯日誌分析,發現一個很大的相同點,就是4278190091s這個值。首先解釋一下這個值代表的意義,通常情況下如果一個cpu超過10s沒有喂狗(執行watchdog程序)就會拋出soft lockup(軟死鎖)錯誤並且掛起內核。但是這個值盡然是4278190091s,並都是一樣的。完全可以理解爲是一個固定的錯誤,爲了驗證自己的想法,我就在redhat官方網站搜索這個錯誤信息,讓我非常激動的是,盡然找到了相同的bug(url:https://access.redhat.com/knowledge/solutions/68466),然後查看錯誤的redhat版本和內核版本,都和我們的一樣(redhat6.2和centos6.2對應)。錯如信息和解決方案如下:

Does Red Hat Enterprise Linux 6 or 5 have a reboot problem which is caused by sched_clock() overflow around 208.5 days?

(Updated 21 Feb 2013, 5:11 AM GMT RateSelect ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5Cancel ratingCancel ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5. Average: 5 (1 vote). Show Follow

Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in

rhel5 Content in rhel5 by Marc Milgram Content in rhel6 Content in rhel6 by Marc Milgram Content in kernel Content in kernel by

Marc Milgram Content in Red Hat Enterprise Linux Content in Red Hat Enterprise Linux by Marc Milgram Content in Kernel

Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in

Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar

0 Issue(問題)
•Linux Kernel panics when sched_clock() overflows after an uptime of around 208.5 days.
•Red Hat Enterprise Linux 6.1 system reboots with sched_clock() overflow after an uptime of around 208.5 days
•This symptom may happen on the systems using the CPU which has TSC.
•A process showing BUG: soft lockup - CPU#N stuck for 4278190091s!

Environment(環境)

•Red Hat Enterprise Linux 6
◦Red Hat Enterprise Linux 6.0, 6.1 and 6.2 are affected
◦several kernels affected, see below
◦TSC clock source - **see root cause
•Red Hat Enterprise Linux 5

◦Red Hat Enterprise Linux 5.3, 5.6, 5.8: please refer to the resolution section for affected kernels
◦Red Hat Enterprise Linux 5.0, 5,1, 5.2, 5.4, 5.5 ,5.7: all kernels affected
◦Red Hat Enterprise Linux 5.9 and later are not affected
◦TSC clock source - **see root cause
•An approximate uptime of around 208.5 days.

Resolution(解決方案)

•Red Hat Enterprise Linux 6

◦Red Hat Enterprise Linux 6.x: update to kernel-2.6.32-279.el6 (from RHSA-2012-0862) or later. This kernel is already part of RHEL6.3GA. This fix was implemented with (private) bz765720.
◦Red Hat Enterprise Linux 6.2: update to kernel-2.6.32-220.4.2.el6 (from RHBA-2012-0124) or later. This fix was implemented with (private) bz781974.
◦Red Hat Enterprise Linux 6.1 Extended Update Support: update to kernel-2.6.32-131.26.1.el6 (from RHBA-2012-0424) or later. This fix was implemented with (private) bz795817.
•Red Hat Enterprise Linux 5

◦architecture x86_64/64bit

■Red Hat Enterprise Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.
■Red Hat Enterprise Linux 5.8.z: upgrade to kernel-2.6.18-308.11.1.el5 (from RHSA-2012-1061) or later.
■Red Hat Enterprise Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.
■Red Hat Enterprise Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.
◦architecture x86/32bit

■Red Hat Enterprise Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.
■Red Hat Enterprise Linux 5.8.z: upgrade to kernel-2.6.18-308.13.1.el5 (from RHSA-2012-1174) or later.
■Red Hat Enterprise Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.
■Red Hat Enterprise Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.

Root Cause(根本原因)
•An insufficiently designed calculation in the CPU accelerator in the previous kernel caused an arithmetic overflow in the sched_clock() function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter (TSC) clock source.
•This problem will occur only when system uptime becomes 208.5 days or exceeds 208.5 days.
•This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer occur under these circumstances.
•On Red Hat Enterprise 5, this problem is a timing issue and very very rare to happen.
•**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast access clock whereas the HPET and PMTimer are both slow access clocks. Using notsc would be a significant performance hit.
Diagnostic Steps
Note:
This issue could likely happen in numerous locals that deal with time
in the kernel. For example, a user running a non-Red Hat kernel had the
kernel panic with a soft lockup in __ticket_spin_lock.

通過上面的信心我們完全可以確認這個是linux內核的一個bug,這個bug的原因上面也相信描述了,就是對於x86_64體系結構的內核版本,如果啓動時間超過208.5天就會導致溢出。

雖然得到了上面的信息證實了內核panic的原因,不過自己想了解一下淘寶的內核工程師是否也應該遇到過同樣的問題,所以就在qq上找以前聊過的淘寶內核工程師確認這個問題。結果證明:他們也遇到過同樣的錯誤,並且也不能重現,解決方案還是升級內核版本。

4.總結

到此爲止基本上已經可以確認這個問題了,排查問題的過程是艱辛的,但是當你終於得到想要的答案了你將會是興奮無比,這個和升職加薪沒有任何的關係,這個就是技術的樂趣吧。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章