Chapter 7. Linux System Crashes and Hangs

Chapter 7. Linux System Crashes and Hangs

Introduction

Gathering Information

Conclusion

7.1. Introduction

One of Linux’s claims to fame is stability and infrequency of system crashes and hangs. Development versions of the kernel are less stable, but unfortunately, mainstream kernel versions will also sometimes crash or hang. The beauty of Linux is that when this happens, users have the ability to track the problem down to a failing line of source code and even fix it themselves if they’re so inclined! With proprietary operating systems, your only course of action is to contact the company or author(s) of the operating system and hope that they can help you. Anyone that has had this happen to them in the past knows that this can be the start of a very lengthy battle full of frustration, which is still never guaranteed to end happily with a solution to the problem. At least with Linux, with a full set of debugging and diagnostic tools and some knowledge about where to look, one is much better armed and ready to seek and find a solution. The goal of this section is to discuss the many tools that Linux provides to get you well on your way to analyzing some of the most crucial operating system problems. We will discuss how to set up, configure, and use a serial console; how to read and understand a kernel Oops report; and how to determine the failing line of source code from it.

Linux 名声之一是系统稳定性和较低的崩溃和挂起的频率。linux内核的开发版本不那么稳定, 但不幸的是, 主流内核版本有时也会崩溃或挂起。Linux 的优点在于, 当这种情况发生时, 用户有能力将问题跟踪到源代码的行中, 甚至可以自行修复, 如果有这种需求的话!有了商业操作系统, 你唯一的行动是联系操作系统的公司, 并希望他们能帮助你。曾经发生过这种事情的人都知道, 这可能是一场充满挫败感的漫长战役的开始, 但这仍然永远无法保证解决这个问题。至少对于 Linux 来说, 用户拥有一整套的调试和诊断工具, 以及一些有关的知识, 他可以自己寻找解决方案。本节的目标是讨论 Linux 提供的工具, 使您能够分析一些最关键的操作系统问题。我们将讨论如何配置和使用串行控制台;如何阅读和理解内核的Oops报告;以及如何确定源代码的失败行。

7.2. Gathering Information

Gathering information for analysis either by you or by a support group of kernel developers on the Internet is the first step on the road to troubleshooting a serious system problem. There are really two main types of serious system problems—a crash and a hang. A crash occurs when the kernel is aware of the problem it has just encountered and is able to do something about it before putting itself to sleep or rebooting. A hang occurs when there is a serious deadlock within the kernel that happens without warning and does not give the kernel the ability to do anything about it. Much of the tools for tracking the cause of each type of problem are the same; however, with a hang some of the diagnostic information may not be available, as the kernel didn’t have a chance to write it to disk or the screen.

收集分析信息是解决严重系统问题的第一步。严重的系统问题确实有两种主要类型, 即崩溃和挂起。当内核意识到它刚刚遇到的问题时, 并且在将自己置于休眠或重新启动之前能够对其进行一些操作,就会发生崩溃。当内核中存在严重的死锁而没有警告,并且不让内核做其它事情时, 就会发生挂起。调查问题的工具是相同的;但是, 当内核挂起时,一些诊断信息可能不可用, 因为内核没有机会将其写入磁盘或屏幕。

7.2.1. Syslog Explained

The syslog is usually /var/log/messages but can be anywhere by modifying values in /etc/syslog.conf. The syslog file is a text log of messages written by the syslog daemon, which reads the messages directly from kernel buffers. Monitoring this file regularly can often provide crucial hints about the general health of your system such as disk space running out, memory being exhausted, I/O errors, device failures, and so on. When restarting the system after a crash or hang, this file should be examined first to see if anything was logged that could give a hint as to what might have caused the problem.

日志记录通常是/var/log/messages, 可以修改/etc/ syslog.conf,把日志放到系统的任何地方。syslog 文件是由 syslog 守护进程的消息文本日志, 它直接从内核缓冲区读取消息。定期监视此文件通常可以发现系统的一些问题, 如磁盘空间耗尽、内存耗尽、I/O错误、设备故障等。在崩溃或挂起后重新启动系统时, 应首先检查此文件, 以查看是否记录了可能导致问题的提示。

When doing this, the recommended procedure is the following:

1.

Wait for the system to fully restart.

2.

Log in or su (switch user) to root.

3.

Examine /etc/syslog.conf to determine the system log filename.

For example, look for something like the following:

#

# save the rest in one file

#

*.*;mail.none;news.none    -/var/log/messages

4.

Open /var/log/messages (or similar from Step 3) in vi, less, or your favorite editor.

5.

Navigate to the end of the file.

6.

Search backward for “restart.” You should see a line like this:

Mar 14 19:45:21 linux syslogd 1.4.1: restart.

7.

The messages immediately prior to the line found in Step 5 are the last messages logged before the system restarted. Examine them for anything suspicious.

Generally, individual messages are composed of the following sequence of tokens:

<timestamp> <hostname> <message origin> <message text>

 

An example showing a message coming from a kernel driver:

Mar 10 22:49:05 linux kernel: usb.c: deregistering driver serial

 

<timestamp> = Mar 10 22:49:05

<hostname> = linux

 

<message origin> = kernel: usb.c

<message text> = deregistering driver serial

 

We know from the message origin that the message came from the kernel, and we are also given the exact source file containing the message. We can now easily examine usb.c and search for the message text to see exactly where kernel execution was. The code in question looks like this:

我们从消息来源得知消息来自内核, 我们也得到了包含消息的确切源文件。现在, 我们可以检查 usb. c 并搜索消息文本, 以查看内核执行的确切位置。问题代码如下所示:

/**

*       usb_deregister - unregister a USB driver

*       @driver: USB operations of the driver to unregister

*

*      Unlinks the specified driver from the internal USB driver list.

*/

void usb_deregister(struct usb_driver *driver)

{

        struct list_head *tmp;

 

        info("deregistering driver %s", driver->name);

        if (driver->fops != NULL)

                usb_minors[driver->minor/16] = NULL;

 

Knowing exactly what this code does is not important—it is important at this time only to know that by a simple log message we can determine exactly what was being executed, which can be invaluable in problem determination. One thing to note, though, is that info is a macro defined as:

确切地知道这段代码所做的并不重要-在这个时候, 重要的是要知道, 通过简单的日志消息, 我们可以准确地确定问题发生时正在执行什么, 这在确定问题上是无价的。不过, 有一点要注意的是, 信息是一个宏, 定义为:

Code View: Scroll / Show All

#define info(format, arg...) printk(KERN_INFO __FILE__ ": " format "\n" , ## arg)

 

As you can see, printk is the key function that performs the act of writing the message to the kernel message buffer. We can also see that the standard C macro __FILE__ is used to dump the source filename.

正如您所看到的, printk 是执行将消息写入内核消息缓冲区的关键函数。我们还可以看到, 标准的 C 宏 __FILE__ 用于转储源文件名。

There is, however, no guarantee that anything will appear in the syslog. Often, even in the case of a crash, the klogd system logger is unable to write the information to disk. This is where a serial console becomes important. When properly configured, the kernel will send important log messages to the serial console as well as to the buffers where the syslog daemon picks them up. After the messages are sent over the serial line, the remote console will receive the messages and preserve them there.

但是, 不会保证出错信息会出现在日志中。通常, 在崩溃的情况下, klogd 系统记录器无法将信息写入磁盘。这是一个串行控制台变得重要的地方。正确配置后, 内核将向串行控制台以及日志守护进程获取数据的缓冲区发送重要的日志消息。在通过串行线发送消息后, 远程控制台将接收消息并将它们保存在那里。

7.2.2. Setting up a Serial Console

Possibly one of the most important tools in diagnosing and determining the cause of system crashes or hangs is the use of a serial console to gather diagnostic information. A serial console is not something that every system requires, although it doesn’t hurt to have. But if your system inexplicably crashes or hangs more than a few times in a very short period of time, it is highly recommended.

使用串行控制台收集诊断信息可能是诊断和确定系统崩溃或挂起原因的最重要工具之一。串行控制台不是每个系统都需要的东西, 尽管它没有坏处。但是, 如果您的系统在很短的时间内莫名其妙地崩溃或挂起几次, 强烈建议您这样做。

Note: In the kernel source package included with most distributions, the file Documentation/serial-console.txt is an excellent guide on setting up and configuring a serial console.

注意: 在大多数发行版的内核源包中, 文件Documentation/serial-console.txt是设置和配置串行控制台的优秀指南。

 

In conjunction with using the serial console is to enable the kernel magic SysRq key; refer to the sysrq section of Chapter 3, “The /proc Filesystem,” for more information. A serial console helps because often when a system enters a panic state, for example in the case of a kernel oops, the kernel will dump information to the kernel log daemon. This normally means that the information gets written to /var/log/messages; however, there are cases where the system is unable to perform the writes to disk, so this is where the serial console proves most useful. When properly set up, the information is dumped over the serial port as well so the remote system, which is in a healthy state, will receive and save this information. This information can then be analyzed using the techniques discussed in this section or forwarded to the appropriate support group.

与串行控制台结合使用的是内核SysRq 键; 有关详细信息, 请参阅第3章 "/proc文件系统" 的 sysrq 部分。因为通常当系统进入panic状态, 例如在内核的oops情况下, 内核会转储信息到内核日志守护程序。这通常意味着信息被写入/var/log/messages;但是, 有些情况下, 系统无法对磁盘执行写入操作, 所以这是串行控制台证明最有用的地方。在正确设置后, 信息将被转储到串行端口上, 因此当远程系统处于健康状态时, 将接收并保存此信息。然后, 可以使用本节中讨论的技术或转发到相应的技术支持组来分析此信息。

7.2.3. Connecting the Serial Null-Modem Cable

The first thing to do is obtain a serial null-modem cable. These can commonly be found at any computer store and generally sell for a minimal amount. You should also check the external serial ports on both computers to determine whether you require 9 or 25 pin connectors. Newer null-modem cables are sold with both 9 and 25 pin connectors on each end, so it may be desirable to purchase this kind.

首先要做的是获取串行电缆。这些通常可以在任何计算机商店找到, 价格一般很便宜。还应检查两台计算机上的外部串行端口, 以确定是否需要9或25针连接器。新的串行电缆每端都有9和25针连接器, 因此购买这种电源可能是最方便的。

Once the cable is in place, it should be tested to ensure that data can be sent from one machine to the other. Do this by first starting a communications program on the serial console of a separate system. If the machine is running Linux, minicom is a good choice, and if Windows is running, HyperTerminal is also fine.

一旦电缆到位, 应进行测试, 以确保数据可以从一台机器发送到另一个。首先在独立系统的串行控制台上启动一个通信程序。如果机器运行 Linux, minicom 是一个不错的选择, 如果 Windows 运行, HyperTerminal也很好。

Note: minicom may not be installed on your Linux system by default. The executable is usually /usr/bin/minicom. If you do not have it installed, most distributions include it as an optionally installed package that can be installed at any time.

注意: 默认情况下, minicom 可能没有安装在 Linux 系统上。可执行文件通常是/usr/bin/minicom。如果您没有安装它, 则大多数发行版都将其作为可以随时安装的可选安装的软件包。

 

Generally, the default communications settings will suffice. Next, on the source machine run the following as root assuming the null-modem cable is connected to the first serial port on the computer (/dev/ttyS0):

通常, 默认的通信设置就足够了。接下来, 在源计算机上, 假设串行电缆连接到计算机上的第一个串行端口 (/dev/ttyS0), 则将以下内容作为 root 运行:

"stty speed 38400 < /dev/ttyS0 ; echo 'This should appear on the remote

machine' >/dev/ttyS0"

 

The message, “This should appear on the remote machine,” should appear in the communications program on the serial console. If it does not, some things to check for are

"这应该出现在远程计算机上" 的消息应该出现在串行控制台上的通信程序中。如果没有, 有些事情要检查

  1. The cable is in fact a null-modem cable.
  2. The cable is connected to the first serial port on the server; if it isn’t, change /dev/ttyS0 to /dev/ttyS1 and try again.
  3. The serial console communication program is listening on the correct serial port.
  4. The speed is set to 38400 on the serial console in the communications program.

7.2.4. Enabling the Serial Console at Startup

When you’ve verified that the serial console works, the next step is to configure Linux to send important messages over the serial connection. This is generally done by booting with an additional kernel boot parameter. The boot parameter can be typed in manually at boot time with most boot loaders, or it can be added permanently to the boot loader’s configuration file. The additional parameter should look like this:

验证串行控制台工作正常后, 下一步是配置 Linux 以通过串行连接发送重要消息。这通常是通过使用附加的内核启动参数进行引导来完成的。启动参数可以在引导时手动键入, 大多数引导加载程序, 也可以将其永久添加到引导加载器的配置文件中。附加参数应如下所列:

console=ttyS0,38400

 

Note that this shouldn’t replace any existing console= parameter but should be inserted before them instead. It is important to maintain any existing console=tty parameters so as not to render the virtual consoles unusable. For my system, I use GRUB, and here’s my menu.lst entry to enable the serial console:

请注意, 这不应替换任何现有的控制台 = 参数, 而应在它们之前插入。必须维护任何现有的控制台 = tty 参数, 以免使虚拟控制台无法使用。对于我的系统, 我使用 GRUB, 这是我的菜单. 启用串行控制台的入门项:

title Linux

kernel (hd0,7)/boot/vmlinuz-2.4.21-99-default root=/dev/hda8 vga=0x314

splash=silent desktop hdc=ide-scsi hdclun=0 showopts console=/dev/ttyS0

console=/dev/tty0

initrd (hd0,7)/boot/initrd-2.4.21-99-default

 

After rebooting with this entry, the serial console should be set up, and some of the boot messages should appear. Note that not all boot messages appear on the serial console. When the server is booted up, be sure to enable the logging or capture feature in the communications program on the serial console to save all messages sent to it. For your reference, here’s an example of the kind of log captured in a serial console that would be sent to a distribution’s support team or a kernel developer (in this particular case, VMWare Inc. may also need to be contacted because the process name is vmware-vmx, but note that this does not in any way mean that there is a problem with this program):

使用此项重新启动后, 应在串行控制台显示某些引导消息。请注意, 并非所有启动消息都出现在串行控制台上。启动服务器时, 请确保在串行控制台上的通信程序中启用日志记录或捕获功能, 以保存发送给它的所有信息。下面是一个在串行控制台中捕获的日志类型的示例, 供你参考,  它将被发送到linux发行版的技术支持团队或内核开发人员 (在特殊情况下,可能还需要联系vmware 公司, 因为进程名为 vmware-vmx, 但请注意, 这并不意味着该程序存在问题):

Code View: Scroll / Show All

Unable to handle kernel NULL pointer dereference at virtual address 000005e8

 printing eip: c429aa52

*pde = 00000000

Oops: 0002 2.4.21-99-default #1 Wed Sep 24 13:30:51 UTC 2003

CPU:    0

EIP:      0010:[usb-uhci:uhci_device_operations+31708122/24331932]

Tainted: PF

EIP:    0010:[<c429aa52>]   Tainted: PF

EFLAGS: 00213246

eax: 00000000   ebx: 00000001  ecx: c36a4720  edx: 00000001

esi: 00000000   edi: 00000000  ebp: c7f7fe68  esp: c7f7fe50

ds: 0018   es: 0018   ss: 0018

Process vmware-vmx (pid: 2808, stackpage=c7f7f000)

Stack: 00000000 c7f7fec8 42826000 c0ed8860 c8dc7520 ffffffea c7f7ff88 c429886c

       cdf9ce00 00000000 00000000 00000000 c036a2e0 c0121ce2 c0121bc9 00000000

       00000001 c0121992 00003046 00003046 00000001 00000000 c02e0054 c7f7fec8

Call Trace:       [usb-uhci:uhci_device_operations+31699444/24340610] [bh_action+66

/80] [tasklet_hi_action+57/112] [do_softirq+98/224] [do_IRQ+156/176]

Call Trace:      [<c429886c>] [<c0121ce2>]  [<c0121bc9>] [<c0121992>] [<c010a1dc>]

 [call_do_IRQ+5/13] [__do_mmap_pgoff+1361/1632] [__do_mmap_pgoff+1419/1632] [__

do_mmap2+88/176] [__do_mmap2+119/176] [sys_ioctl+470/618]

   [<c010c4d8>] [<c0131ab1>] [<c0131aeb>] [<c010e8d8>] [<c010e8f7>] [<c0153526>]

  [sys_mmap2+35/48] [system_call+51/64]

  [<c010e993>] [<c0108dd3>]

Modules: [(vmmon:<c4298060>:<c429defc>)]

Code: 89 9e e8 05 00 00 50 50 8b 45 0c 50 57 e8 ea 0c 00 00 83 c4

<3>sr0: CDROM (ioctl) reports ILLEGAL REQUEST. spurious 8259A interrupt: IRQ7.

7.2.5. Using SysRq Kernel Magic

The SysRq Kernel Magic hotkey provides the ability to possibly communicate with a panicked kernel to dump information such as stack tracebacks of running tasks, the current program counter (PC) location, memory status, and so on. Refer to the /proc/sys/kernel/sysrq section in Chapter 3 for a detailed discussion of how to make use of this feature.

SysRq键提供了可能与panic的内核通信, 以转储信息, 如运行栈 tracebacks等, 当前程序计数器 (PC) 位置, 内存状态等。有关如何使用此功能的详细讨论, 请参阅第3章中的/proc/sys/kernel/sysrq 部分。

7.2.6. Oops Reports

An Oops Report is basically just a dumping of information by the kernel when it encounters a serious problem. The problem can be a code related bug such as dereferencing a NULL pointer, accessing out of bounds memory, and so on. The Oops Report is generated by the kernel to help the end user debug, locate, and fix the problem. Sometimes when an oops occurs, the system may seem to continue running normally, but is likely to be in an unstable state. It is a good idea to save all your work and reboot as soon as possible.

Oops报告基本上只是内核在遇到严重问题时的信息输出。问题可能是与代码相关的 bug, 如访问 NULL 指针、访问超出内存边界等。Oops报告由内核生成, 可以帮助最终用户调试、定位和修复问题。有时, 当Oops发生时, 系统似乎会继续正常运行, 但可能处于不稳定状态。尽快保存您的所有工作和重新启动,是一个好主意。

To demonstrate a real live kernel oops, we modified the kernel source to allow a user to trap the kernel at will. We discuss how we did this in the section, “Adding a Manual Kernel Trap,” which may be skipped if you are not interested in the somewhat simple modifications we made to the kernel. In the sections that follow it, “Examining an Oops Report” and “Determining the Failing Line of Code,” we will discuss the Oops Report generated by the manual kernel trap in detail. We will also illustrate how to find the exact line of source code that caused the kernel oops solely from the Oops Report, so you may wish to read the “Adding a Manual Kernel Trap” section after reading the other two sections.

为了演示一个运行内核oops, 我们修改了内核源代码, 让用户可以随时捕获内核。在 "添加手动内核陷阱" 一节中,我们讨论了如何进行此操作, 如果您对我们对内核所做的一些简单的修改不感兴趣, 则跳过该操作。在下面的章节中, "检查一个Oops报告" 和 "确定失败的代码行", 我们将详细讨论手动内核陷阱生成的Oops报告。我们还将说明如何找到源代码的确切行, 导致内核的Oops报告, 所以你可能希望阅读 "添加手动内核陷阱" 一节后阅读其他两个部分。

7.2.7. Adding a Manual Kernel Trap

For the purposes of easily demonstrating a kernel oops and how to examine the resulting information, we modified the kernel source code to add an interface in the /proc filesystem, which root could manipulate to force a trap in the kernel. We used the 2.6.2 kernel source downloaded directly from ftp.kernel.org on an AMD64 machine. Describing kernel source code in detail is beyond the scope of this book, but we’re including details on what we did for the curious reader who may be able to use this example as a very basic primer on how to get started with the kernel source.

为了方便地演示内核Oops, 以及检查结果信息, 我们修改了内核源代码, 在/proc文件系统中添加一个接口, root用户可以通过该操作来强制内核Oops。我们在AMD64 机上直接使用了从ftp.kernel.org 下载的2.6.2 内核源代码。详细描述内核源代码超出了本书的范围, 但我们包括了信息, 好奇的读者可以使用这个例子作为内核源码的基本入门。

First, we decided on the interface we wanted to use. We decided to create a new file in the /proc filesystem called “trap_kernel.” The most logical place for it is in /proc/sys/kernel, as entries in this directory are very kernel-specific. Next, we needed to find where in the kernel source the addition of this new file would happen. Using the /proc/sys/kernel/sysrq file as an example, we located the source in kernel/sysctl.c[1]. When editing this file, we first needed to add a global variable that would be the storage for the value, which /proc/sys/kernel/trap_kernel represents. This was simply a matter of adding the following with the default value of 0 to the global declaration scope of the file:

首先, 我们决定了我们想要使用的接口。我们决定在/proc文件系统中创建一个名为 "trap_kernel" 的一个新文件。对于它来说, 最合乎逻辑的地方是 /proc/sys/kernel, 因为这个目录中的条目是针对内核的。接下来, 我们需要找到内核源代码中的哪个位置可以添加新文件。以 /proc/sys/kernel/sysrq 文件为例, 我们找到了kernel/sysctl.c。编辑此文件时, 我们首先需要添加一个全局变量, 它存储 /proc/sys/kernel/trap_kernel 的值。这仅仅是将默认值 0添加到文件的全局范围声明中:

[1] When referring to kernel source files, it’s common to have the pathname start at the /usr/src/linux directory. For example /usr/src/linux/kernel/sysctl.c would be commonly referred to as kernel/sysctl.c.

当查找内核源文件时, 通常是在/usr/src/linux 目录中查找。例如,/usr/src/linux/kernel/sysctl. c 通常称为kernel/sysctl.c。

int trap_kernel_value = 0;

 

Next we needed a new structure that contained information for our new file. The code that we added to the kern_table array of structures is shown in bold as follows:

接下来, 我们需要一个新的结构体, 其中包含了新文件的信息。我们在kern_table结构体数组中添加的代码以粗体显示, 如下所示:

        {

                .ctl_name       = KERN_PRINTK_RATELIMIT_BURST,

                .procname       = "printk_ratelimit_burst",

                .data           = &printk_ratelimit_burst,

                .maxlen         = sizeof(int),

                .mode           = 0644,

                .proc_handler   = &proc_dointvec,

        },

        {

                                                        .ctl_name       = KERN_TRAP_KERNEL,

                                                        .procname       = "trap_kernel",

                                                        .data           = &trap_kernel_value,

                                                        .maxlen         = sizeof (int),

                                                        .mode           = 0644,

                                                        .proc_handler   = &proc_dointvec_trap_kernel,

                                                        },

        { .ctl_name = 0 }

};

 

As is shown, we set the proc_handler to proc_dointvec_trap_kernel, which is basically a customized version of the real proc_dointvec. Without going into too much detail, proc_dointvec is used to handle user manipulation of an integer-based /proc file entry. /proc/sys/kernel/sysrq and /proc/sys/kernel/shmmni are examples of interfaces that work with integers. proc_dointvec is a “wrapper” function, which simply calls do_proc_dointvec with customized parameters:

如上所示, 我们将 proc_handler 设置为 proc_dointvec_trap_kernel, 这基本上是一个自定义的真实 proc_dointvec 版本。在不深入详细信息的情况下, proc_dointvec 是用于处理基于整数/proc文件项的用户操作。/proc/sys/kernel/sysrq 和/proc /sys/kernel/shmmni 是使用整数的接口的示例。proc_dointvec 是一个 "包装" 函数, 它只需使用自定义参数调用 do_proc_dointvec:

int proc_dointvec(ctl_table *table, int write, struct file *filp,

                     void __user *buffer, size_t *lenp)

{

    return do_proc_dointvec(table,write,filp,buffer,lenp,

                            NULL,NULL);

}

 

The next step was to add the code for our proc_dointvec_trap_kernel customized handler, as follows:

下一步是为我们的 proc_dointvec_trap_kernel 自定义处理程序添加代码, 如下所示:

Code View: Scroll / Show All

int proc_dointvec_trap_kernel( ctl_table *table, int write,

                               struct file *filp,

                               void __user *buffer, size_t *lenp )

{

    char c;

 

    if ( write )

    {

       if ( ( get_user( c, (char *)buffer ) ) == 0 )

       {

          if ( c == '9' )

          {

           printk( KERN_ERR "trap_kernel: got '9'; trapping the kernel now\n" );

             char *trap = NULL;

             *trap = 1;

          }

          else

          {

             printk( KERN_ERR "trap_kernel: ignoring '%c'\n", c );

          }

       }

       else

       {

          printk( KERN_ERR "trap_kernel: problem getting value\n" );

       }

    }

    return do_proc_dointvec(table,write,filp,buffer,lenp,NULL,NULL);

}

 

The idea is that before calling do_proc_dointvec, we added some logic to determine if the user really is requesting a kernel trap. The logic is first to check the write flag, which is set to 1 if a write is being performed by the user, with, for example, the following command:

这个想法是在调用 do_proc_dointvec 之前, 我们添加了一些逻辑来确定用户是否真的在请求内核陷阱。逻辑首先检查写标志, 如果用户正在执行写入操作, 则设置为 1, 例如以下命令:

linux> echo 1 > /proc/sys/kernel/trap_kernel

 

The write flag is set to 0 if a command such as this is being run:

如果正在运行这样的命令, 则写标志设置为 0:

linux> cat /proc/sys/kernel/trap_kernel

 

If a write is not being performed, nothing extra is done, and the real do_proc_dointvec function is called. The kernel carries on normally and returns the value of the trap_kernel_value global variable to the user. If a write is being performed, the buffer, which is a void pointer, is converted to the char c so that it can be examined. If converting this value is successful (that is, get_user returns 0), we check to see if the character is the number 9, which we chose as the trigger for trapping the kernel. If the char is a 9, we log a message indicating that the kernel will be trapped and then perform a simple NULL pointer dereference.

如果未执行写操作, 则不进行任何额外操作, 并调用真正的 do_proc_dointvec 函数。内核正常进行, 并将 trap_kernel_value 全局变量的值返回给用户。如果正在执行写入操作, 则缓冲区 (即 void 指针) 将转换为 char c, 以便可以对其进行检查。如果转换此值成功 (即 get_user 返回 0), 我们将检查字符是否为数字 9, 我们选择它作为内核的trap触发器。如果 char 是 9, 我们会记录一条消息, 指示内核将被捕获, 然后执行简单的 NULL 指针引用。

If the char is not 9, we log the fact that we’re ignoring that value in terms of trapping the system and allow the kernel to perform the write and carry on normally.

如果 char 不是 9, 我们记录的事实是, 我们忽略了这个值, 不trap系统, 并允许内核执行写入和进行正常。

So if the user performs the following:

因此, 如果用户执行以下操作:

linux> echo 3 > /proc/sys/kernel/trap_kernel

 

the action will be performed without a problem, and an entry in /var/log/messages such as the following should appear:

将执行该操作, 但不存在问题, 并且在/var/log/messages会出现一项(如下):

Feb 12 11:27:40 linux kernel: trap_kernel: ignoring '3'

 

If the user executes this command:

如果用户执行此命令:

linux> echo 9 > /proc/sys/kernel/trap_kernel

 

an Oops should immediately happen, which we will discuss in the next section.

一个Oops应该马上发生, 我们将在下一节讨论。

7.2.8. Examining an Oops Report

Examining an Oops Report is not an exact science and requires a bit of ingenuity and experience. To know the basic steps to take and to understand how things generally work is a great start and is the goal of this section. The oops dumps are a little different between 2.4.x and 2.6.x kernels.

研究一个Oops报告并不是一门精确的科学, 需要一些智慧和经验。知道采取的基本步骤和理解事情通常是如何工作的,是一个伟大的开始。这是本节的目标。在2.4.x 和2.6.x 内核之间的转储有点不同。

7.2.8.1. 2.6.x Kernel Oops Dumps

Oops dumps in a 2.6.x kernel can usually be examined as is without the need to process them through ksymoops, as is needed for 2.4.x oops. The /proc/sys/kernel/trap_kernel “feature” we added to manually force a kernel oops results in the oops dump shown as follows:

在一个2.6.x 内核Oops转储通常可以检查, 因为是不需要通过 ksymoops处理他们。但是2.4.xOops需要。我们添加到/proc/sys/kernel/trap_kernel的手动强制内核Oops转储的功能显示如下:

Code View: Scroll / Show All

Oops: 0002 [1]

CPU 0

Pid: 2680, comm: bash Not tainted

RIP:                                  0010:[<ffffffff8013ae64>]

<ffffffff8013ae64>{proc_dointvec_trap_kernel+84}

RSP: 0018:00000100111c9ea8  EFLAGS: 00010216

RAX: 0000000000000031 RBX: 00000100111c8000 RCX: ffffffff80413210

RDX: 0000010011cd9280 RSI: ffffffff80413970 RDI: 0000000000000000

RBP: 0000002a95e59000 R08: 0000000000000033 R09: 000001001f57dbc0

R10: 0000000000000000 R11: 0000000000000175 R12: 0000000000000001

R13: 00000100111c9ee8 R14: 000001001115b980 R15: ffffffff803a1390

FS:     00000000005614a0(0000)        GS:ffffffff8045bc40(0000)

knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006a0

Process bash (pid: 2680, stackpage=10011cda280)

Stack:   0000002a95e59000     0000000000000002    ffffffff803a1390

000001001115b980

            0000000000000002    0000000000000001   0000002a95e59000

ffffffff8013a6e8

       0000000000000002 0000000000000000

Call               Trace:<ffffffff8013a6e8>{do_rw_proc+168}

<ffffffff8016baf4>{vfs_write+228} <ffffffff8016bc09>{sys_write+73}

<ffffffff80111830>{system_call+124}

 

Code: c6 04 25 00 00 00 00 01 eb 23 66 90 0f be f2 48 c7 c7 c0 9f

RIP  <ffffffff8013ae64>{proc_dointvec_trap_kernel+84}    RSP

<00000100111c9ea8>

CR2: 0000000000000000

 

The dump was taken directly from /var/log/messages, and we manually removed the preceding <timestamp> <hostname> kernel: markings on each line for easier reading.

转储直接取自/var/log/messages, 我们手动删除标记每行的<timestamp><hostname>kernel:  以便于阅读.

If you didn’t skip the “Adding a Manual Kernel Trap” section, the problem shown by the Oops Report will probably seem pretty obvious. But let’s pretend that we have no idea where this trap came from and it’s something that needs to be fixed.

如果您没有跳过 "添加手动内核陷阱" 部分, 则Oops报告显示的问题可能看起来很明显。但让我们假装不知道这个trap是从哪里来的, 这是需要修补的东西。

Let’s first analyze the first line:

让我们先分析第一行:

Oops: 0002 [1]

 

Initially, this looks really cryptic and useless, but it actually contains a great deal of information! To begin with, we know that we’re in an oops situation, meaning that the kernel has encountered an unexpected problem. The “0002” that follows “Oops:” is a hexadecimal number that represents the page fault error code. By decoding this, we can determine exactly what the error condition was. To decode it, we first need to convert it to binary—hexadecimal 2 is binary 10. Now we need to compare this value against Table 7.1 to decode the meaning.

最初, 这看起来真的是神秘和无用的, 但它实际上包含了大量的信息!首先, 我们知道我们处于糟糕的境地, 这意味着内核遇到了一个意想不到的问题。下面的 "0002" 是一个十六进制数字, 表示页错误代码。通过解码, 我们可以准确地确定错误条件是什么。为了解码它, 我们首先需要将它转换为二进制,十六进制2是二进制10。现在, 我们需要将此值与表7.1 进行比较以解码含义。

Table 7.1. Page Fault Error Codes.

 

Value

Bit

0

1

0

No page found

Protection fault

1

Read

Write

2

Kernel-mode

User-mode

3[2]

Fault was not an instruction fetch

Fault was an instruction fetch

 

[2] Bit 3 is defined on the x86-64 architecture but not on the i386 architecture.

Using Table 7.1, we know that binary 10 means that no page was found, a write was attempted in kernel-mode, and the fault was not due to an instruction fetch (remember that this Oops Report was taken from an AMD64 machine). So from this information we know that a page was not found when doing a write operation within the kernel.

使用表 7.1, 我们知道二进制10表示没有找到任何页面, 在内核模式下尝试写入, 而错误不是由于指令提取 (请记住, 这个Oops报告是从 AMD64 机器中提取的)。因此, 从这些信息中我们知道在内核中执行写操作时找不到页面。

The next piece of information on the first line is the [1]. This is the die counter that basically keeps track of the number of oopses that have occurred since the last reboot. In our case, this is the first oops we’ve encountered.

第一行的下一条信息是 [1]。这是一个模计数器, 它基本上跟踪自上次重新启动后发生的 Oops 的数量。在我们的例子中, 这是我们遇到的第一个 "Oops"。

The next line is CPU 0 and indicates which CPU performed the instruction that caused the fault. In this case, my system only has one CPU, so 0 is the only possible value. The next line is:

下一行是 cpu 0, 表示哪个 cpu 执行导致故障的指令。在这种情况下, 我的系统只有一个 CPU, 所以0是唯一可能的值。下一行是:

Pid: 2680, comm: bash Not tainted

 

The Pid indicates which user-land process ID initiated the problem, and comm: tells us that the process name was bash. This makes sense because the command I issued was redirecting the output of an echo command, which is all handled by the shell. Not tainted tells us that our kernel has not been tainted by any modules not under the GPL and/or forcefully loaded. The next line in the oops dump is:

Pid 指示哪些用户空间进程 ID 引发了问题, comm: 告诉我们进程名称是 bash。这是有意义的, 因为我发出的命令是重定向echo命令的输出, 这是由 shell 处理的。Not tainted 告诉我们, 我们的内核的所有模块都在 GPL 下,并且没有强行加载。Oops转储的下一行是:

RIP:                                  0010:[<ffffffff8013ae64>]

<ffffffff8013ae6>{proc_dointvec_trap_kernel+84}

 

RIP is the name of the instruction pointer on the AMD64 architecture. On 32-bit x86 systems, it is called the EIP instead and, of course, is 32 bits long rather than 64 bits long. (Though it somewhat fits in this situation to have RIP stand for Rest In Peace, this is not the intended meaning here.) The “0010” is a dumping of the CS register. The dumping of the CS register shows us that the current privilege level (CPL) was 0. This number corresponds to which permission or ring level the trap occurred in.

RIP 是 AMD64 体系结构上指令指针的名称。在32位 x86 系统上, 它被称为 EIP, 当然, 它是32位长而不是64位长。(虽然在某种程度上,让RIP代表Rest In Peace,更适合这种情况,但这里不是这个意思)。"0010" 是寄存器CS中的值。寄存器CS显示, 目前的特权 (CPL) 是0。此数字对应于陷阱发生的权限或环级别。

Note: Ring level is a term used to refer to what permissions code has when running on a CPU. Ring 0 has unlimited access, therefore kernel mode runs at ring 0. Ring level 3 is for user mode processes. On Linux, ring levels 1 and 2 are unused.

注意: 环级别是指在 CPU 上运行时使用的权限代码的术语。环0有无限的访问, 因此内核模式运行在环0。环级别3用于用户模式进程。在 Linux 上, 环级别1和2不使用。

 

The [<ffffffff8013ae64>] is a dumping of the RIP register. This means that the instruction pointer was pointing to the instruction at memory address 0xffffffff8013ae64 at the time the trap occurred. Note the RIP address is printed out again by another function, which the kernel calls to perform more work.

[<ffffffff8013ae64>] 是RIP 寄存器的内容。这意味着在发生陷阱时,指令指针指向内存地址0xffffffff8013ae64的指令。注意 RIP 地址由另一个函数打印出来, 内核调用它来执行更多的工作.

The {proc_dointvec_trap_kernel+84} is the result of the kernel translating the RIP address into a more human readable format. proc_dointvec_trap_kernel is the name of the function in which the RIP lies. The +84 means that the RIP is at an offset of decimal 84 into the proc_dointvec_trap_kernel function. This value is extremely useful in determining exactly what caused the trap, as we shall see in the “Determining the Failing Line of Code” section.

{proc_dointvec_trap_kernel+84} 是内核将 RIP 地址转换为更人性化的可读格式。proc_dointvec_trap_kernel 是 RIP 所在函数的名称。+84 表示RIP在 proc_dointvec_trap_kernel 函数的十进制84的偏移地址。此值在确定导致trap的确切原因方面非常有用, 我们将在 "确定失败代码行" 一节中看到这一点。

The next several lines are a dumping of the registers and their contents:

RSP: 0018:00000100111c9ea8 EFLAGS: 00010216

RAX: 0000000000000031 RBX: 00000100111c8000 RCX: ffffffff80413210

RDX: 0000010011cd9280 RSI: ffffffff80413970 RDI: 0000000000000000

RBP: 0000002a95e59000 R08: 0000000000000033 R09: 000001001f57dbc0

R10: 0000000000000000 R11: 0000000000000175 R12: 0000000000000001

R13: 00000100111c9ee8 R14: 000001001115b980 R15: ffffffff803a1390

FS:     00000000005614a0(0000)        GS:ffffffff8045bc40(0000)

knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006a0

 

Describing each register in detail and what it is used for is beyond the scope of this book[3]. It suffices to say for now that the values stored in the registers could be very useful when examining the assembly code surrounding the trapping instruction.

详细描述每个寄存器和它的用途是超出本书的范围 [3]。现在可以说,在检查trap指令周围的汇编代码时,存储在寄存器中的值可能非常有用。

[3] For detailed information on the registers, I recommend reading the AMD64 Architecture Manuals found at AMD’s Web site. Similarly for other architectures, manuals describing detailed hardware information can usually be found at the vendor’s Web site.

有关寄存器的详细信息, 我建议阅读 AMD 网站上的 AMD64 体系结构手册。类似于其他体系结构, 通常可以在供应商的网站上找到描述详细硬件信息的手册。

The next line of interest is下一行的兴趣是

Process bash (pid: 2680, stackpage=10011cda280)

 

The Process bash (pid: 2680 is a reiteration of what was dumped on the third line. stackpage=10011cda280 shows us the kernel stack page that is involved in this process.

进程 bash (pid: 2680 是对第三行上的内容的重复). stackpage=10011cda280 向我们展示了此进程中涉及的内核栈页。

The next few lines dump out a predefined number of 64-bit words from the stack. In the case of AMD64, this number is set to 10.

接下来的几行将输出栈中预定义的64位字数。在 AMD64 的情况下, 此数字设置为10。

Stack:   0000002a95e59000  0000000000000002  ffffffff803a1390

000001001115b980

           0000000000000002 0000000000000001  0000002a95e59000

ffffffff8013a6e8

       0000000000000002 0000000000000000

 

The values are not too important at first glance. Depending on the assembly instructions surrounding the trap and the context of the particular problem, the values shown here may be needed. They are basically dumped here with an “in case they’re needed” purpose. The next lines in the oops dump that are of interest are:

这些值乍一看并不重要。根据围绕trap的汇编指令和特定问题的上下文, 此处显示的值可能是必需的。他们输出在这里,是防备“万一有需要的”。在oops转储中的有个的下一行是:

Call             Trace:<ffffffff8013a6e8>{do_rw_proc+168}

<ffffffff8016baf4>{vfs_write+228}

                           <ffffffff8016bc09>{sys_write+73}

<ffffffff80111830>{system_call+124}

 

Call Trace shows a list of the last few functions that were called before the trap occurred. Now we know that the execution path looked like what is shown in Figure 7-1.

Call Trace显示在发生trap之前调用的最后几个函数的列表。现在我们知道执行路径看起来像图7-1 所示。

Figure 7-1. Call trace in kernel.

[View full size image]

 


7.2.9. Determining the Failing Line of Code

Now that we know a bit more about the trap and the characteristics of it, we need to know what actually caused it. In almost all cases, the cause is a programming error, so the key to answering that question is first knowing where in the source code the problem occurs. We know the function name, and we know the offset of assembly instructions into that function, so the first step is to find where the function proc_dointvec_trap_kernel is defined. Rather than scouring through the hundreds of source files that comprise the Linux kernel, using a tool called cscope is far easier (see the “Setting up cscope to Index Kernel Sources” section for more information). Plugging the trapping function name into Find this C symbol: we get the screen shown in Figure 7.2.

现在我们已经对Trap和它的特征,了解得足够多了, 我们需要知道究竟是什么造成的。在几乎所有情况下, 原因都是编程错误, 因此回答该问题的关键是首先知道源代码中的问题发生的位置。我们知道函数名, 我们知道汇编指令在函数中的偏移量, 所以第一步是查找定义函数 proc_dointvec_trap_kernel 的位置。与其在组成 Linux 内核的数以百计的源文件中进行查找, 不如使用称为 cscope 的工具更容易 (请参阅 "设置 cscope 索引内核源" 部分以了解更多信息)。将函数名称输入查找此 C 符号: 我们得到图7.2 所示的屏幕。

Figure 7.2. cscope of kernel code.

[View full size image]

 

If you read the “Adding a Manual Kernel Trap” section, the results just shown will be very familiar. By typing “0,” “1,” or “2,” the respective file will be loaded and the cursor position will be pointed directly at the proc_dointvec_trap_kernel symbol. For now, we just wanted to find out what source file the function appeared in. We now know that it is /usr/src/linux-2.6.2-2/kernel/sysctl.c.

Now we need to somehow translate the offset into the function of decimal 84 into a source line in the function in sysctl.c.

如果您阅读了 "手动添加内核陷阱" 部分, 则非常熟悉显示的结果。通过键入 "0"、"1" 或 "2", 相应的文件将被加载, 游标位置将直接指向 proc_dointvec_trap_kernel 符号。现在, 我们只想找出函数出现的源文件。我们现在知道, 它是/usr/src/linux-2.6. 2-2/kernel/sysctl.c。现在, 我们需要以某种方式将proc_dointvec_trap_kernel+84 转换为sysctl.c 中函数的源代码行。

First we have to get debug symbols built into the sysctl.o object. For the 2.6.x kernel source, do this with the following set of steps.

首先, 我们必须得到的调试符号内置 sysctl. o 对象。对于2.6.x 内核源, 请使用下面的步骤执行此操作。

1.

cd /usr/src/linux-2.6.2
Everything we’ll do here must be run out of the top level of the kernel source tree.

2.

rm kernel/sysctl.o
Removing this file forces it to be recompiled.

3.

export CONFIG_DEBUG_INFO=1
Searching /usr/src/linux-2.6.2/Makefile for CONFIG_DEBUG_INFO shows that if this is set, the -g compile flag will be added to the compilation.

4.

make kernel/sysctl.o
This will recompile the sysctl.c file with the -g parameter included.

5.

If you happened to notice the size of sysctl.o before and after performing the steps here, you’ll notice it’s now much larger. This is expected because it should now contain several extra debug symbols.

6.

Now for the critical part of converting the object file (sysctl.o) into a listing of assembly instructions mixed with the C source code. This is accomplished with the objdump command. Run it like this:

objdump -d -S kernel/sysctl.o > kernel/sysctl.dump

Note: As documented in the objdump(1) man page, the -d option will disassemble the object, and the -S option will intermix high-level source code with assembly if possible.

注意: 如 objdump (1) 在线手册中所述,-d 选项将对对象进行反汇编,-s 选项将在可能的情况下将高级语言源代码与汇编指令混合在一起输出。

 

We’re now ready to examine the assembly/source dump file and search for our decimal 84 offset. First open sysctl.dump created by the objdump command and search for the beginning of the proc_dointvec_trap_kernel function. It will look something like the following:

现在, 我们已经准备好检查汇编/源代码输出文件, 并搜索我们的十进制84偏移量。首先打开由 objdump 命令创建的 sysctl, 并搜索 proc_dointvec_trap_kernel 函数的开头。将看到以下内容:

Code View: Scroll / Show All

0000000000000fb0 <proc_dointvec_trap_kernel>:

 

int proc_dointvec_trap_kernel(ctl_table *table, int write, struct file *filp,

                     void __user *buffer, size_t *lenp)

{

     fb0:       48 83 ec 38             sub    $0x38,%rsp

    char c;

 

    if ( write )

    fb4:        85 f6                   test   %esi,%esi

    fb6:        48 89 6c 24 10          mov    %rbp,0x10(%rsp,1)

    fbb:        4c 89 64 24 18          mov    %r12,0x18(%rsp,1)

    fc0:        4c 89 6c 24 20          mov    %r13,0x20(%rsp,1)

    fc5:        4c 89 74 24 28          mov    %r14,0x28(%rsp,1)

 

As is shown here, the start of the function is at offset hexadecimal fb0. Because offsets are referenced in hex, we need to convert the decimal 84 to hex, which is 0x54. Next we add 0x54 to 0xfb0 to find the location reported in the oops dump. The offset we need to look for in sysctl.dump is 0x1004.

如下所示, 函数的开始位置为偏移十六进制 fb0。因为偏移是用十六进制显示的, 所以我们需要将十进制84转换为十六进制, 即0x54。接下来, 我们添加0x54 到 0xfb0, 以找到Oops报告的位置。我们需要在 sysctl.dump 中查找的偏移量是0x1004。

Tip: A quick way of doing calculations on a Linux machine is to run “gdb” and enter commands such as print 324 * 434. To do hex calculations, use a command such as print /x 0x54 + 0xfb0. Also note that print can be shortened to p.

提示: 在 Linux 机器上进行计算的一个快速方法是运行 "gdb" 并输入命令, 如打印 324 * 434。要执行十六进制计算, 请使用命令 (如print /x 0x54 + 0xfb0)。另外请注意, 打印可以缩短到 p。

 

Offset 0x1004 in sysctl.dump is shown as follows:

偏移0x1004 在sysctl.dump中的显示如下:

Code View: Scroll / Show All

   {

      if ( ( get_user( c, (char *)buffer ) ) == 0 )

    fe5:       48 89 c8          mov    %rcx,%rax

    fe8:       e8 00 00 00 00    callq  fed <proc_dointvec_trap_kernel+0x3d>

    fed:           85 c0             test   %eax,%eax

    fef:           75 32             jne    1023 <proc_dointvec_trap_kernel+0x73>

       {

          if ( c == '9' )

     ff1:       80 fa 39         cmp    $0x39,%dl

     ff4:       75 1a            jne    1010 <proc_dointvec_trap_kernel+0x60>

          {

            printk( KERN_ERR "trap_kernel: got '9'; trapping the kernel now\n" );

     ff6:       48 c7 c7 00 00 00 00   mov    $0x0,%rdi

     ffd:       31 c0                  xor    %eax,%eax

     fff:       e8 00 00 00 00         callq  1004 <proc_dointvec_trap_kernel+0x54>

             char *trap = NULL;

             *trap = 1;

    1004:       c6 04 25 00 00 00 00   movb   $0x1,0x0

    100b:       01

    100c:       eb 23                  jmp    1031 <proc_dointvec_trap_kernel+0x81>

    100e:       66                     data16

    100f:       90                     nop

          }

          else

          {

             printk( KERN_ERR "trap_kernel: ignoring '%c'\n", c );

    1010:       0f be f2               movsbl  %dl,%esi

    1013:       48 c7 c7 00 00 00 00   mov     $0x0,%rdi

    101a:       31 c0                  xor     %eax,%eax

 

Looking at offset 0x1004, we see the assembly instruction movb $0x1, 0x0, which means to store the value 1 into the memory address 0x0. On x86 and AMD64 hardware, this produces a page fault that results in the trap we observed.

查看偏移 0x1004, 我们看到汇编指令 movb 0x1, 0x0, 这意味着将值1存储到内存地址0x0 中。在 x86 和 AMD64 硬件上, 这会产生一个页面错误, 导致我们观察到的Trap。

Immediately above the assembly instruction is the C source code that resulted in the generation of this assembly instruction.

在汇编指令之上的是生成此汇编指令的 C 源代码。

char *trap = NULL;

*trap = 1;

 

This code is a blatant example of the classic NULL pointer dereferencing programming error. We’ve found the cause of the trap! Of course, this is the code I added as discussed in the “Adding a Manual Kernel Trap” section.

此代码是经典 的访问NULL 指针错误的示例。我们找到了Trap的原因!当然, 这是我添加的代码, 正如在 "添加手动内核陷阱" 一节中所讨论的那样。

Determining the line of code does not always go this smoothly. Occasionally, the calculated offset is nowhere to be found in the disassembled object output. One of the main reasons for this is the use of a different version of the compiler to generate the listing than was used to originally compile the object in which the oops occurred. It is extremely important to use the exact same compiler version. Different versions of a compiler, even minor release changes, can and will change the ordering of the assembly instructions. Different optimization levels and/or options will also change the generated assembly, even when the same compiler level is used. At the assembly level, a single different or relocated instruction will make definitively locating the trapping instruction very difficult. With some ingenuity, though, even if the compiler levels are slightly different, one can get close enough to the trapping area of code to discover the fault.

确定代码行并不总是那么顺利。有时, 在反汇编对象输出中找不到计算出的偏移量。其中一个主要原因是使用不同版本的编译器生成汇编程序,。使用版本完全相同的编译器非常重要。编译器的不同版本 (即使是次要版本的改变)将更改汇编指令的顺序。即使使用相同的编译器, 不同的优化级别和/或选项也会改变生成的汇编程序。在汇编级别, 单个不同的或重新定位的指令将使trap指令的定位非常困难。不过, 尽管编译器稍有不同, 但只要有一些智慧, 就可以接近代码的trap区域, 以发现错误。

7.2.9.1. 2.4.x Kernel Oops Dumps

Analyzing kernel oops dumps in a Linux 2.4.x kernel is slightly different from doing so in a 2.6.x kernel. The main reason for this is the addition of the kallsyms feature/patch to the 2.6 mainline kernel source. The kallsyms feature provides a listing of all kernel symbols that the kernel itself can use to translate a hexadecimal address into a human readable symbol name.

在 Linux 2.4.x 内核中分析内核的Oops转储与在2.6.x 内核中进行此操作稍有不同。主要原因是 kallsyms 功能添加到2.6 主线内核源。kallsyms 功能提供了内核可以用来将十六进制地址转换为可读符号名称的所有内核符号的列表。

Note that many 2.4-based distributions have backported the kallsyms feature to their customized 2.4-based kernels. This means that if an oops occurs in these distributions, the dumped data is automatically formatted. If your distribution does not have the kallsyms patch or you are running a 2.4.x kernel as downloaded from kernel.org, you will need to manually format the oops message before it is useful to anyone. To do this, run the utility ksymoops with the appropriate parameters as documented in the ksymoops(8) man page.

请注意, 许多基于2.4 的发行版已将 kallsyms 功能合并到自定义的基于2.4 的内核。这意味着, 如果在这些发行版中发生了错误, 则将自动格式化输出的数据。如果您的发行版没有 kallsyms 功能, 或者您运行的是从 kernel.org 下载的2.4.x 内核, 则需要手动格式化Oops信息。为此, 请使用 ksymoops 工具。具体参数可以参阅ksymoops(8)的帮助手册。

7.2.10. Kernel Oopses and Hardware

A kernel oops does not always mean that a software error was found in the kernel. In fact, hardware actually fails quite often, so it should never be ruled out as the possible cause of an oops. The question then becomes how does one determine if the oops is caused by faulty hardware. Here are some clues that can point to faulty hardware:

内核Oops并不总是意味着在内核中发现了软件错误。事实上, 硬件也经常失败, 所以它不应该被排除。然后问题就变成如何确定, Oops是由硬件故障造成的。以下是一些可以指向硬件故障的线索:

  • Oopses occurring in places where it is almost impossible for them to occur in after examining the source code around the trap.
  • Recurring oopses that don’t always happen in the same place. Software bugs are almost always reproducible, but faulty hardware, especially bad RAM, can cause strange things to happen in seemingly random places.
  • Sudden start of oopses. If the operating system has been running fine with little or no changes to it, and oopses all of a sudden start occurring.
  • Hard machine lockups where nothing is displayed to the screen, SysRq magic hotkey does nothing, and only a hard reboot can be done.

If the hardware is suspected, tests should be performed immediately starting with the RAM, unless there is reason to suspect some other piece of hardware. Many servers have built in diagnostic programs that can be accessed from the BIOS menu. These should be run first. Soft ware testing programs such as memtest86 (available at www.memtest86.org) should also be run to examine the RAM.

如果怀疑硬件, 则应立即从 RAM 开始进行测试, 除非有理由怀疑其他硬件。许多服务器都内置了可从 BIOS 菜单访问的诊断程序。这些应该先运行。软件测试程序, 如 memtest86 (可从 www.memtest86.org下载) 也应该运行, 以检查 RAM。

Note: memtest86 is historically meant for 32-bit x86-based hardware. The need for the support of this software on AMD64 hardware was great, and this was one of the reasons for the spin-off creation of memtest86+ (available at www.memtest.org). memtest86+ fully supports all AMD Opteron chips.

注: memtest86 在历史上是针对32位 x86-based 硬件开发的。在 AMD64 硬件上支持此软件的很有必要, 这也是 memtest86+ (可在 www.memtest.org 上找到) 存在的原因之一。memtest86+ 完全支持所有的 AMD Opteron芯片。

 

The fsck (file system check) utility should also be run on all hard drive partitions to which the server has access. Sometimes a corruption can occur on a filesystem, which can lead to corruption in libraries or executables. This corruption can lead to instructions within the code becoming scrambled, therefore resulting in very bad things happening. The fsck utility will detect and fix any corruptions that may exist.

还应在服务器的所有可访问硬盘分区上运行 fsck (文件系统检查) 实用程序。有时, 文件系统可能发生损坏, 这可能导致库或可执行文件的损坏。这种损坏会导致代码中的指令变得混乱, 从而导致非常糟糕的事情。fsck 实用程序将检测并修复可能存在的任何损坏。

7.2.11. Setting up cscope to Index Kernel Sources

Most Linux distributions include the cscope package. If your distribution does not, you can easily locate it for download on the Internet. cscope is a utility that scans through a defined set of source code files and builds a single index file. You can then use the cscope interface to search for function, variable, and macro definitions, listings of which functions call a particular function, listings of what files include a particular header file, and more. It’s an extremely useful tool, and we highly recommend setting it up if you plan on looking through any amount of kernel source. It isn’t just limited to indexing kernel source; you can set it up to “scope” through any set of source files you wish!

大多数 Linux 发行版包括 cscope 包。如果您的发行版没有, 您可以在互联网上下载。cscope 是一个实用程序, 它扫描一组已定义的源代码文件, 并生成一个索引文件。然后, 您可以使用 cscope 接口搜索函数、变量和宏定义, 其中函数调用特定函数的列表、包含特定头文件的文件的列表等。这是一个非常有用的工具, 如果您计划查看任何内核源代码,我们强烈建议设置它。它不仅限于索引内核源;您可以索引任何源代码文件!

Before running cscope to retrieve symbol information, a symbols database must be built. Assuming your kernel source tree is in /usr/src/linux-2.6.2, the following command will create the symbols database in the file /usr/src/linux-2.6.2/cscope.out:

在运行 cscope 以检索符号信息之前, 必须生成一个符号数据库。假设您的内核源代码在/usr/src/linux -2.6.2, 下面的命令将在文件/usr/src/linux-2.6. 2/cscope.out中创建符号数据库:

find /usr/src/linux-2.6.2 \( -name "[A-Za-z]*.[CcHh]" -o -name "*.[ch]pp"

-o -name "*.[CH]PP" -o -name "*.skl" -o -name "*.java" \) -print |

cscope -bku -f /usr/src/linux-2.6.2/cscope.out -i -

 

The cscope parameters to take note of are -b for build the symbol database (or cross-reference as it is referred to in the man page) and -k for “kernel mode,” which ensures that the proper include files are scoured.

cscope 参数-b为生成符号数据库 (或cross-reference表示交叉引用帮助手册) 和-k 用于 "内核模式", 以确保正确的头文件被索引。

Once the database is built, searching it is simply a matter of running this command:

建立数据库后, 搜索它只是一个命令的问题:

cscope -d -P /usr/src/linux-2.6.2 -p 20 -f /usr/src/linux-2.6.2/cscope.out

 

The following simple Korn Shell script can be used to handle all of this for you.

下面的简单Korn Shell 脚本可以用来处理所有这些问题。

Code View: Scroll / Show All

#!/bin/ksh

#

# Simple script to handle building and querying a cscope database for

# kernel source with support for multiple databases.

 

build=0

dbpath="linux"

force=0

 

function usage {

   echo "Usage: kscope [-build [-force]] [-dbpath <path>]"

}

 

while [ $# -gt 0 ]; do

   if [ "$1" = "-build" ]; then

      build=1

   elif [ "$1" = "-dbpath" ]; then

      shift

      dbpath=$1

   elif [ "$1" = "-force" ]; then

      force=1

   else

      usage

      exit 1

   fi

   shift

done

 

if [ $build -eq 1 ]; then

   if [ -f "/usr/src/$dbpath/cscope.out" -a $force -ne 1 ]; then

     echo "cscope database already exists. Use '-force' to overwrite."

      exit 1

   fi

   echo "Building /usr/src/$dbpath/cscope.out ..."

  find /usr/src/$dbpath \( -name "[A-Za-z]*.[CcHh]" -o -name "*.[ch]pp"

-o -name "*.[CH]PP" -o -name "*.skl" -o -name "*.java" \) -print |

cscope -bku -f /usr/src/$dbpath/cscope.out -i -

   echo "Done."

else

   if [ ! -f "/usr/src/$dbpath/cscope.out" ]; then

      echo "cscope database (/usr/src/$dbpath/cscope.out) not found."

      exit 1

   fi

   cscope -d -P /usr/src/$dbpath -p 20 -f /usr/src/$dbpath/cscope.out fi

7.3. Conclusion

Unfortunately, system crashes and hangs do happen. With the knowledge and tools presented in this chapter, you should be armed with the capability to at least know where to gather the diagnostic data needed for others to analyze and determine the cause of the problem. You can use the tips and suggestions presented in Chapter 1, “Best Practices and Initial Investigation” in conjunction with the information in this chapter to most efficiently and effectively get a solution to your problem.

不幸的是, 系统崩溃和挂起确实发生。通过本章中介绍的知识和工具, 您应该具备在何处收集诊断数据的能力, 以便其他人分析和确定问题的原因。您可以使用本章 1 "最佳做法和初步调查" 中介绍的提示和建议, 并结合本章中的信息, 以最有效地解决您的问题。

For the more advanced or curious reader, this chapter presents sufficient information for one to dive deeply into the problem in an attempt to solve it without help from others. This is not always the best approach depending on your situation, so caution should be exercised. In either case, with the right tools and knowledge, dealing with system problems on Linux can be much easier than on other operating systems.

对于更高级或好奇的读者, 本章提供了足够的信息, 让读者深入探究这个问题, 并试图在没有帮助的情况下解决它。这并不总是最好的方法(取决于你的情况), 所以应该谨慎。无论在如何, 使用正确的工具和知识, 处理 Linux 上的系统问题比在其他操作系统上容易得多。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章