Dtrace VS truss

這篇文章大體是翻譯:http://www.brendangregg.com/DTrace/dtracevstruss.html
收穫非常大,嘗試的翻譯下,和大家分享。

DTrace VS truss

1 問題提出
2 分析一個經典的性能問題
3 使用truss分析
4 使用dtrace分析
5 dtrace VS truss 對系統的影響
6 WHY

1 問題提出

用過DTrace的都說其非常強大,也說其對系統的影響甚小且是安全的,然一直都沒有明瞭的數據做支撐,更沒有具體的實驗提供說明。本篇文章嘗試通過一個具體實例(比較DTrace和truss)來解釋這些問題。

2 分析一個經典的性能問題

下面的信息說明單CPU的系統 面臨性能問題。

# uptime
11:22am up 31 day(s), 13:53, 0 users, load average: 1.16, 1.13, 0.90
# vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr dd f0 s0 --in sy cs us sy id
0 0 0 4718344 644968 1 15 1 0 0 0 0 0 0 0 0 411 195 160 1 1 98
0 0 0 4611776 563304 463 5546 0 0 0 0 0 0 0 0 0 955 5776 1002 29 64 7
0 0 0 4611672 563208 450 5420 0 0 0 0 0 0 0 0 0 947 5671 1015 28 64 8
0 0 0 4611568 563232 478 5712 0 0 0 0 0 1 0 0 0 959 5825 1018 28 65 7
0 0 0 4611568 563232 446 5362 0 0 0 0 0 0 0 0 0 939 5634 992 29 63 8
0 0 0 4611568 563256 409 4922 0 0 0 0 0 2 0 0 0 918 5193 883 24 70 6
0 0 0 4611568 563240 455 5456 0 0 0 0 0 0 0 0 0 938 5648 987
28 64 8

這是一個單CPU,所以在平均負載達到1.16就說明系統存在有問題。CPU也只有8%的idle時間,其餘92%都被消耗掉了。

用prstat查看下各個進程的負載情況

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
1435 ttest 1456K 976K run 0 0 0:02:58 10% ksh/1
1275 ttest 12M 9424K run 59 0 2:48:44 7.2% Xvnc/1
1421 ttest 8376K 4320K run 59 0 0:01:38 1.3% dtterm/1
11654 ttest 9960K 7288K sleep 59 0 0:20:47 0.5% Xvnc/1
28535 root 4912K 4560K cpu0 59 0 0:00:00 0.4% prstat/1
439 root 21M 1096K sleep 59 0 0:51:01 0.1% nessusd/1
10314 root 2104K 1472K sleep 59 0 0:07:52 0.1% rpc.rstatd/1
4998 root 93M 44M sleep 29 10 0:39:25 0.1% java/23
11814 ttest 7880K 4432K sleep 59 0 0:03:41 0.1% sdtperfmeter/1
1466 ttest 7904K 3296K run 59 0 0:26:59 0.1% sdtperfmeter/1
11775 ttest 8288K 4896K sleep 59 0 0:03:07 0.1% dtterm/1

可以看到所有的進程並沒有佔用這麼多的CPU資源,所有加起來最多20%,那麼其他的92%-20%=72%的CPU哪裏去了呢。

有一種可能是CPU被內核線程佔用了,而沒有反應在進程上。例如,CPU可能被中斷線程佔用,可以用instrstat查看下。(instrstat也是用dtrace實現的)

# intrstat
device | cpu0 %tim
-------------+---------------
hme#0 | 540 1.7

device | cpu0 %tim
-------------+---------------
hme#0 | 527 1.6
uata#0 | 1 0.0

這個表明僅僅丟失了1.7%的CPU(hme是系統網卡),其餘的近70%呢。

還有可能佔用CPU的地方是物理地址和虛擬地址的轉換(TLB,TSB miss)。tratpstat可以查看狀態。

# trapstat -T 1
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
0 u 8k| 8271 0.3 5221 0.6 | 30363 1.1 4608 0.5 | 2.5
0 u 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
0 u 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
0 u 4m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
0 k 8k| 2748 0.1 0 0.0 | 421897 15.2 7822 1.1 |16.4
0 k 64k| 0 0.0 0 0.0 | 339 0.0 0 0.0 | 0.0
0 k 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0
0 k 4m| 0 0.0 0 0.0 | 789 0.0 0 0.0 | 0.0
==========+===============================+===============================+====
ttl | 11019 0.4 5221 0.6 | 453388 16.3 12430 1.6 |19.0

可以看到大概19%的CPU要被用掉,但是還是不能解釋爲什麼70%-19%=51%的CPU不見了。ok,我們繼續檢測系統內存,I/O,和網絡使用率。同樣都不能說明爲什麼消耗了這麼的CPU資源。

# iostat -xnmpz 5
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 1.6 0.0 0.0 0.0 0.6 0 0 c0t0d0
0.0 0.2 0.0 1.6 0.0 0.0 0.0 0.6 0 0 c0t0d0s0 (/)
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.2 0.0 0.3 0.0 0.0 0.0 0.2 0 0 c0t0d0
0.0 0.2 0.0 0.3 0.0 0.0 0.0 0.2 0 0 c0t0d0s0 (/)

沒有頻繁的硬盤操作。

3 使用truss 分析

使用truss分析,必須運行程序,或者給定PID。之前我們已經用prstat檢測系統進程,其中進1435大概用掉10%的CPU,那麼我們使用truss查看下這個進程。

# truss -p 1435
waitid(P_ALL, 0, 0xFFBFE808, WEXITED|WTRAPPED|WSTOPPED) = 0
sigaction(SIGCLD, 0xFFBFE808, 0xFFBFE8A8) = 0
schedctl() = 0xFF38C000
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
stat64("date", 0xFFBFE828) Err#2 ENOENT
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
stat64("/usr/bin/date", 0xFFBFE828) = 0
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
fork1() = 8636
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
Received signal #18, SIGCLD [caught]
siginfo: SIGCLD CLD_EXITED pid=8636 status=0x0000
schedctl() = 0xFF38C000
setcontext(0xFFBFE610)
getcontext(0xFFBFE768)
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
setcontext(0xFFBFE420)
sigaction(SIGCLD, 0xFFBFE808, 0xFFBFE8A8) = 0
waitid(P_ALL, 0, 0xFFBFE808, WEXITED|WTRAPPED|WSTOPPED) = 0
sigaction(SIGCLD, 0xFFBFE808, 0xFFBFE8A8) = 0
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
stat64("date", 0xFFBFE828) Err#2 ENOENT
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
stat64("/usr/bin/date", 0xFFBFE828) = 0
lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
fork1(

輸出打印的非常快,我們可以看到有fork 和SIGCLD,所以我們加一個選項使truss可以跟進子進程。

# truss -pf 1435
1435: lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
1435: stat64("date", 0xFFBFE828) Err#2 ENOENT
1435: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
1435: lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
1435: stat64("/usr/bin/date", 0xFFBFE828) = 0
1435: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
1435: fork1() = 23050
23050: fork1() (returning as child ...) = 1435
23050: getpid() = 23050 [1435]
23050: lwp_self() = 1
23050: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
23050: getpid() = 23050 [1435]23050: sigaction(SIGXFSZ, 0xFFBFE7F0, 0xFFBFE890) = 0
23050: sigaction(SIGTERM, 0xFFBFE7F0, 0xFFBFE890) = 0
1435: lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
1435: sigaction(SIGCLD, 0xFFBFE808, 0xFFBFE8A8) = 0
23050: execve("date", 0x000582F4, 0x000584F8) Err#2 ENOENT
23050: execve("/usr/bin/date", 0x000582F4, 0x000584F8) argc = 1
23050: resolvepath("/usr/bin/date", "/usr/bin/date", 1023) = 13
23050: resolvepath("/usr/lib/ld.so.1", "/lib/ld.so.1", 1023) = 12

1435進程會頻繁的調用date函數,下面可以集中查看關於date的執行情況。

# truss -ftexec -p 1435
5890: execve("date", 0x000582F4, 0x000584F8) Err#2 ENOENT
5890: execve("/usr/bin/date", 0x000582F4, 0x000584F8) argc = 1
5892: execve("date", 0x000582F4, 0x000584F8) Err#2 ENOENT
5892: execve("/usr/bin/date", 0x000582F4, 0x000584F8) argc = 1
5894: execve("date", 0x000582F4, 0x000584F8) Err#2 ENOENT
5894: execve("/usr/bin/date", 0x000582F4, 0x000584F8) argc = 1
5896: execve("date", 0x000582F4, 0x000584F8) Err#2 ENOENT
5896: execve("/usr/bin/date", 0x000582F4, 0x000584F8) argc = 1

輸出同樣是非常的快,看來好像是不斷調用date而耗費過量的CPU資源。同時date是一個執行很快的程序,在被prstat捕捉前就運行結束了。

其實這個性能問題是我人工加上

# while :; do date; done;

4 使用dtrace分析。

編寫一個簡單的腳本。這個腳本很簡單,就是輸出當前系統中執行的系統調用。

#!/usr/sbin/dtrace -s

syscall::exec:return, syscall::exece:return
{
trace(execname);
}


運行

# ./ckproc.d
dtrace: script './ckproc.d' matched 2 probes
CPU ID FUNCTION:NAME
0 112 exece:return ksh
0 112 exece:return date
0 112 exece:return ksh
0 112 exece:return date
0 112 exece:return ksh
0 112 exece:return date
0 112 exece:return ksh
0 112 exece:return date
0 112 exece:return ksh
0 112 exece:return date
0 112 exece:return ksh
0 112 exece:return date

可以看到有很多date執行,但是date是一個執行很快的命令,所以它可能並不是解釋系統的性能問題。我們還需要進一步確認date到底佔用了多少CPU。先讓truss試試。

# truss -cf -p 1435
^Csignals ------------
SIGCLD 57
total: 57


syscall seconds calls errors
sigaction .011 457
getcontext .001 57
setcontext .005 114
waitid .014 229
fork1 .122 229
lwp_sigmask .023 1201
schedctl .004 229
stat64 .056 458 229
-------- ------ ----
sys totals: .241 2974 229
usr time: .071
elapsed: 13.050

在13.050的運行時間裏,1435號進程僅僅耗費了0.241的sys time和0.071的sys time,顯然這不是一個正確的答案。

再次交給dtrace來完成吧。

#cat shortlived.d

dtrace:::BEGIN
{
/* save start time */
start = timestamp;

/* procs用來保存進程運行時間*/
procs = 0;

/* print header */
printf("Tracing... Hit Ctrl-C to stop./n");
}

/*
* Measure parent fork time 計算父進程fork子進程所需要的時間
*/
syscall::*fork*:entry
{
/* save start of fork */
self->fork = vtimestamp;
}

/*arg0!=0 表明這是個父進程*/

syscall::*fork*:return
/arg0 != 0 && self->fork/
{
/* record elapsed time for the fork syscall計算fork時間 */
this->elapsed = vtimestamp - self->fork;
procs += this->elapsed;
self->fork = 0;
}

/*
* Measure child processes time 計算子進程的運行時間。arg0==0表明是個子進程。
*/
syscall::*fork*:return
/arg0 == 0/
{
/* save start of child process */
self->start = vtimestamp;

/* memory cleanup */
self->fork = 0;
}
proc:::exit
/self->start/
{
/* record elapsed time for process execution */
this->elapsed = vtimestamp - self->start;
procs += this->elapsed;

/* sum elapsed by process name and ppid */
@Times_exec[execname] = sum(this->elapsed/1000000);
@Times_ppid[ppid] = sum(this->elapsed/1000000);

/* memory cleanup */
self->start = 0;
}

/*
* Print report
*/
dtrace:::END
{
this->total = timestamp - start;
printf("short lived processes: %6d.%03d secs/n",
procs/1000000000, (procs%1000000000)/1000000);
printf("total sample duration: %6d.%03d secs/n",
this->total/1000000000, (this->total%1000000000)/1000000);
printf("/nTotal time by process name,/n");
printa("%18s
%@12dms/n", @Times_exec);
printf("/nTotal time by PPID,/n");
printa("%18d
%@12dms/n", @Times_ppid);
}

運行

# ./shortlived.d
Tracing... Hit Ctrl-C to stop.
^C
short lived processes: 10.675 secs
total sample duration: 16.080 secs

Total time by process name,
date 9794 ms

Total time by PPID,
1435 9794 ms

shortlived的進程佔用了大概10.6/16=66%的CPU,而且在這些shortlived進程中只有一個date,現在我們就能確定性能問題就是date造成的。


5 dtrace vs truss 對系統的影響


系統依然運行無限循環的date命令

# while :;do date; done;

並用sar查看下現在系統的執行情況。

# sar -c 1 10

SunOS ans80064 5.10 Generic sun4u 11/18/2008

16:54:41 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s
16:54:42 6156 235 214 76.47 76.47 33945 30282
16:54:43 6160 238 215 78.22 77.23 34238 31633
16:54:44 6174 232 211 76.47 77.45 33562 29686
16:54:45 6113 247 222 75.73 75.73 34245 32588
16:54:46 5412 212 191 66.67 65.69 30281 27114
16:54:47 5414 217 196 65.35 66.34 30350 28514
16:54:48 6146 238 217 76.47 76.47 33922 31542
16:54:49 6024 230 209 76.47 75.49 33009 30088
16:54:50 6195 256 213 75.49 75.49 33754 30546
16:54:52 6536 286 217 73.79 74.76 33485 32517

Average 6034 239 210 74.12 74.12 33082 30456

系統大概每秒總能執行74個new進程。

然後運行truss,再用sar查看系統運行的情況

# sar -c 1 10

SunOS ans80064 5.10 Generic sun4u 11/18/2008

16:54:17 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s
16:54:18 6708 871 832 41.18 20.59 423676 73527
16:54:19 6858 887 839 43.14 20.59 422283 75096
16:54:20 6858 893 844 41.18 20.59 422217 73583
16:54:21 6783 892 842 39.22 20.59 420703 84269
16:54:22 6806 875 835 41.18 20.59 418634 75995
16:54:23 6796 881 850 41.58 20.79 430140 76747
16:54:24 6741 862 830 42.72 20.39 418596 72583
16:54:25 6824 880 850 41.18 21.57 439096 73420
16:54:26 6759 866 832 41.18 20.59 422171 71759
16:54:27 6824 882 850 43.56 20.79 427311 75493

Average 6795 879 840 41.61 20.71 424469 75243

哇,下降的非常厲害,加入truss之後,系統每秒只能執行20個new進程,下降多達(74-20)/74=73%。

換成dtrace呢

#sar -c 1 10

SunOS ans80064 5.10 Generic sun4u 11/18/2008

17:15:12 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s
17:15:13 6875 350 315 69.90 69.90 166544 141523
17:15:14 6058 231 205 75.49 75.49 32213 28519
17:15:15 6105 227 205 75.49 74.51 32079 29568
17:15:16 6078 230 209 75.49 76.47 32115 28831
17:15:17 5460 204 184 66.34 66.34 29243 27083
17:15:18 6131 266 217 72.82 72.82 36572 33469
17:15:19 6021 232 209 74.29 74.29 31427 38946
17:15:20 6046 229 207 75.49 74.51 32119 25904
17:15:21 6162 231 209 76.24 77.23 31858 29743
17:15:22 5994 224 202 74.76 74.76 32024 26863

Average 6094 243 216 73.63 73.63 45703 41140

由74.12%變成73.63%,對系統基本沒有什麼影響。

6 WHY

同樣是系統分析工具,爲什麼dtrace和truss會有這麼打差別呢。用mpstat可以給我們提供些線索。

truss運行期間

# mpstat 1 8
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 65 0 0 416 315 168 1 0 0 0 244 1 2 0 97
0 4873 0 0 611 510 1368 199 0 2 0 6377 38 61 0 1
0 4659 0 0 607 508 1327 181 0 3 0 6236 38 60 0 2
0 4735 0 0 616 516 1383 213 0 1 0 6278 36 62 0 2
0 4780 0 0 626 525 1410 255 0 1 0 6530 37 62 0 1
0 4862 0 0 612 512 1398 212 0 1 0 6474 37 62 0 1
0 4723 0 0 620 519 1383 227 0 1 0 6447 37 61 0 2
0 4771 0 0 602 503 1397 231 0 1 0 6383 36 62 0 2


dtrace運行期間

#mpstat 1 8
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 64 0 0 416 315 168 1 0 0 0 243 1 2 0 97
0 3367 0 0 907 743 739 107 0 0 0 4004 38 58 0 4
0 5287 0 0 1146 947 999 115 0 0 0 5675 29 63 0 8
0 5274 0 0 1176 977 987 130 0 2 0 5817 29 63 0 8
0 5595 0 0 1153 950 989 103 0 1 0 5730 27 65 0 8
0 5586 0 0 1149 948 994 85 0 0 0 5726 28 64 0 8
0 5514 0 0 1143 942 987 86 0 1 0 5652 27 64 0 9
0 5551 0 0 1149 948 999 96 0 0 0 5728 28 65 0 7

從數據上看,dtrace的context switch的次數比truss少很多,這可能就是原因所在。我們用dtrace查看下是什麼原因導致系統頻繁的context switch。

truss運行期間

# dtrace -n 'sched:::on-cpu { @[execname] = count(); } profile:::tick-20s { exit(0); }'
dtrace: description 'sched:::on-cpu ' matched 2 probes
CPU ID FUNCTION:NAME
0 49497 :tick-20s

powerd 1
snmpd 2
automountd 2
init 2
nmbd 5
inetd 5
fmd 5
svc.configd 5
sendmail 12
xntpd 20
fsflush 21
nscd 30
nessusd 30
dtwm 38
Xsun 39
lockmgr 46
dtrace 46
httpd 47
svc.startd 51
rpc.rstatd 59
dtgreet 103
sdtperfmeter 236
java 608
dtterm 1397
utmpd 1697
sched 2072
Xvnc 2100
date 3251
ksh 4318
truss 10647

在大概20s的檢測時間裏,truss進行了10647次context switch。truss operates by sending control messages to procfs's ctl and lwpctl files, causing the process to freeze for every system call so that truss can print a line of output. Hense truss's behaviour is synchronous to the system calls.

truss運行期間

# dtrace -n 'sched:::on-cpu { @[execname] = count(); } profile:::tick-20s { exit(0); }'
dtrace: description 'sched:::on-cpu ' matched 2 probes
CPU ID FUNCTION:NAME
0 49497 :tick-20s

powerd 1
dtwm 1
snmpd 4
svc.configd 5
inetd 5
fmd 5
nmbd 5
nfsmapid 7
syslogd 12
sendmail 12
xntpd 20
fsflush 21
nscd 24
nessusd 30
rpc.rstatd 40
Xsun 40
lockmgr 46
httpd 46
svc.startd 51
dtgreet 107
dtrace 112
sdtperfmeter 224
java 596
dtterm 1034
Xvnc 1934
date 2366
sched 5377
ksh 6469

DTrace barely steps onto the CPU. This is because DTrace uses a per CPU buffer in the kernel that is read at comfortable intervals by the user level DTrace consumer, /usr/sbin/dtrace. Often this interval is once per second, which would mean the DTrace program is stepping onto the CPU once per second rather than for every system call. This behaviour is asynchronous to the system calls, and results in minimal performance impact.

7 NOTES

由上面的分析我們可以得出truss並不能幫我們分析short lived進程,除dtrace外依然存在一些工具可以幫我們分析

  • prstat -m, this shows many system calls for the bash process.
  • lastcomm, if process accounting is switched on this will list recent processes.
  • praudit, if BSM auditing is switched on this will give extensive details on short lived processes.

全文完

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章