記一次linux oom內存溢出排查過程

 

一,背景

收到應用服務報警,然後登錄上服務器查看原因,發現進程不再了。

 

二,問題分析

1,那麼判斷進程被幹掉的原因如下:

(1),機器重啓了

通過uptime看機器並未重啓

(2),程序有bug自動退出了

通過查詢程序的error log,並未發現異常

(3),被別人幹掉了

由於程序比較消耗內存,故猜想是不是oom了,被系統給幹掉了。所以查messages日誌,發現的確是oom了:

Jul 27 13:29:54 kernel: Out of memory: Kill process 17982 (java) score 77 or sacrifice child

 

2,通過oom詳細信息輸出分析被幹掉的具體原因

[511250.458988] mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[511250.458993] mysqld cpuset=/ mems_allowed=0
[511250.458996] CPU: 7 PID: 30063 Comm: mysqld Not tainted 3.10.0-514.21.2.el7.x86_64 #1
[511250.458997] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[511250.458999]  ffff88056236bec0 0000000040f4df68 ffff88044b76b910 ffffffff81687073
[511250.459002]  ffff88044b76b9a0 ffffffff8168201e ffffffff810eb0dc ffff88081ae80c20
[511250.459004]  ffff88081ae80c38 ffff88044b76b9f8 ffff88056236bec0 0000000000000000
[511250.459007] Call Trace:
[511250.459015]  [<ffffffff81687073>] dump_stack+0x19/0x1b
[511250.459020]  [<ffffffff8168201e>] dump_header+0x8e/0x225
[511250.459026]  [<ffffffff810eb0dc>] ? ktime_get_ts64+0x4c/0xf0
[511250.459033]  [<ffffffff81184cfe>] oom_kill_process+0x24e/0x3c0
[511250.459035]  [<ffffffff8118479d>] ? oom_unkillable_task+0xcd/0x120
[511250.459038]  [<ffffffff81184846>] ? find_lock_task_mm+0x56/0xc0
[511250.459042]  [<ffffffff81093c0e>] ? has_capability_noaudit+0x1e/0x30
[511250.459045]  [<ffffffff81185536>] out_of_memory+0x4b6/0x4f0
[511250.459047]  [<ffffffff81682b27>] __alloc_pages_slowpath+0x5d7/0x725
[511250.459051]  [<ffffffff8118b645>] __alloc_pages_nodemask+0x405/0x420
[511250.459055]  [<ffffffff811cf94a>] alloc_pages_current+0xaa/0x170
[511250.459058]  [<ffffffff81180bd7>] __page_cache_alloc+0x97/0xb0
[511250.459060]  [<ffffffff81183750>] filemap_fault+0x170/0x410
[511250.459078]  [<ffffffffa01b5016>] ext4_filemap_fault+0x36/0x50 [ext4]
[511250.459082]  [<ffffffff811ac84c>] __do_fault+0x4c/0xc0
[511250.459084]  [<ffffffff811acce3>] do_read_fault.isra.42+0x43/0x130
[511250.459087]  [<ffffffff811b1471>] handle_mm_fault+0x6b1/0x1040
[511250.459091]  [<ffffffff810f55c0>] ? futex_wake+0x80/0x160
[511250.459096]  [<ffffffff81692c04>] __do_page_fault+0x154/0x450
[511250.459098]  [<ffffffff81692fe6>] trace_do_page_fault+0x56/0x150
[511250.459101]  [<ffffffff8169268b>] do_async_page_fault+0x1b/0xd0
[511250.459103]  [<ffffffff8168f178>] async_page_fault+0x28/0x30
[511250.459104] Mem-Info:
[511250.459109] active_anon:7922627 inactive_anon:1653 isolated_anon:0
 active_file:1675 inactive_file:2820 isolated_file:0
 unevictable:0 dirty:11 writeback:2 unstable:0
 slab_reclaimable:61817 slab_unreclaimable:25990
 mapped:3607 shmem:4602 pagetables:42625 bounce:0
 free:50021 free_pcp:149 free_cma:0
[511250.459112] Node 0 DMA free:15892kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[511250.459117] lowmem_reserve[]: 0 2814 31994 31994
[511250.459120] Node 0 DMA32 free:119704kB min:5940kB low:7424kB high:8908kB active_anon:2678512kB inactive_anon:276kB active_file:124kB inactive_file:132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129216kB managed:2883436kB mlocked:0kB dirty:0kB writeback:0kB mapped:1100kB shmem:1632kB slab_reclaimable:48796kB slab_unreclaimable:9340kB kernel_stack:5248kB pagetables:11424kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32902 all_unreclaimable? yes
[511250.459124] lowmem_reserve[]: 0 0 29180 29180
[511250.459127] Node 0 Normal free:63896kB min:61608kB low:77008kB high:92412kB active_anon:29011996kB inactive_anon:6336kB active_file:6576kB inactive_file:11148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29881068kB mlocked:0kB dirty:44kB writeback:8kB mapped:13328kB shmem:16776kB slab_reclaimable:198472kB slab_unreclaimable:94604kB kernel_stack:53472kB pagetables:159076kB unstable:0kB bounce:0kB free_pcp:656kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:924 all_unreclaimable? no
[511250.459131] lowmem_reserve[]: 0 0 0 0
[511250.459134] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
[511250.459144] Node 0 DMA32: 9372*4kB (UEM) 2427*8kB (UEM) 1179*16kB (UEM) 369*32kB (UEM) 104*64kB (EM) 31*128kB (EM) 14*256kB (UEM) 9*512kB (UEM) 7*1024kB (UEM) 3*2048kB (M) 0*4096kB = 119704kB
[511250.459154] Node 0 Normal: 1540*4kB (UE) 6148*8kB (UE) 503*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 63392kB
[511250.459162] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[511250.459163] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[511250.459164] 9275 total pagecache pages
[511250.459166] 0 pages in swap cache
[511250.459167] Swap cache stats: add 0, delete 0, find 0/0
[511250.459168] Free swap  = 0kB
[511250.459168] Total swap = 0kB
[511250.459169] 8388478 pages RAM
[511250.459170] 0 pages HighMem/MovableOnly
[511250.459171] 193375 pages reserved
[511250.459172] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[511250.459178] [  444]     0   444    30482      118      63        0             0 systemd-journal
[511250.459180] [  476]     0   476    14365      114      28        0         -1000 auditd
[511250.459182] [  508]     0   508     5315       75      14        0             0 irqbalance
[511250.459184] [  509]   998   509   132421     1908      50        0             0 polkitd
[511250.459186] [  510]     0   510     6686      196      17        0             0 systemd-logind
[511250.459188] [  514]    81   514     6672      148      16        0          -900 dbus-daemon
[511250.459189] [  592]     0   592     6972       52      18        0             0 atd
[511250.459191] [  595]     0   595    31969      188      17        0             0 crond
[511250.459193] [  607]     0   607    28020       44      11        0             0 agetty
[511250.459195] [ 1036]     0  1036   138798     3179      89        0             0 tuned
[511250.459197] [ 1037]     0  1037   174118      357     182        0             0 rsyslogd
[511250.459198] [ 1089]    38  1089     7865      174      19        0             0 ntpd
[511250.459200] [ 4714]     0  4714    26866      243      54        0         -1000 sshd
[511250.459202] [ 6624]     0  6624      920      100       4        0             0 aliyun-service
[511250.459204] [19284]     0 19284     8386      171      21        0             0 AliYunDunUpdate
[511250.459206] [19335]     0 19335    34887     1367      64        0             0 AliYunDun
[511250.459208] [21657]    26 21657    59097     1539      52        0         -1000 postgres
[511250.459210] [21658]    26 21658    48503      264      43        0             0 postgres
[511250.459212] [21660]    26 21660    59124     2338      52        0             0 postgres
[511250.459213] [21661]    26 21661    59097      332      48        0             0 postgres
[511250.459215] [21662]    26 21662    59097      537      47        0             0 postgres
[511250.459217] [21663]    26 21663    59328      513      50        0             0 postgres
[511250.459218] [21664]    26 21664    49067      317      44        0             0 postgres
[511250.459220] [ 7276]     0  7276    32471      164      16        0             0 screen
[511250.459222] [ 7277]     0  7277    29357      123      13        0             0 bash
[511250.459223] [ 7388]     0  7388     4303     1880      12        0             0 sagent
[511250.459225] [ 7747]     0  7747    32504      200      16        0             0 screen
[511250.459226] [ 7748]     0  7748    29357      122      14        0             0 bash
[511250.459228] [ 7781]     0  7781     8051     4108      20        0             0 tagent
[511250.459230] [ 9897]     0  9897  3062553   270245     774        0             0 java
[511250.459231] [ 9937]    26  9937    59406      657      53        0             0 postgres
[511250.459233] [ 9940]    26  9940    60212     2570      57        0             0 postgres
[511250.459235] [ 9997]    26  9997    60098     2346      56        0             0 postgres
[511250.459236] [10076]    26 10076    59574      964      54        0             0 postgres
[511250.459238] [10077]    26 10077    59618     1006      54        0             0 postgres
[511250.459239] [10078]    26 10078    59617     1005      54        0             0 postgres
[511250.459241] [11611]     0 11611    60826     4190      73        0             0 python
[511250.459243] [11619]     0 11619   348938     6222     118        0             0 python
[511250.459245] [12396]    26 12396    60086     2078      56        0             0 postgres
[511250.459246] [12499]  1001 12499  1448783    99046     328        0             0 java
[511250.459248] [12600]  1003 12600  2226317   312995     847        0             0 java
[511250.459249] [29241]     0 29241    78180     1320     101        0             0 php-fpm
[511250.459251] [29242]  1004 29242   135239     2687     108        0             0 php-fpm
[511250.459253] [29243]  1004 29243   134924     2408     108        0             0 php-fpm
[511250.459255] [29244]  1004 29244   135371     2707     108        0             0 php-fpm
[511250.459256] [29245]  1004 29245   143755    11294     125        0             0 php-fpm
[511250.459258] [29246]  1004 29246   135367     2706     108        0             0 php-fpm
[511250.459260] [29826]    27 29826    28792       86      13        0             0 mysqld_safe
[511250.459261] [30051]    27 30051   322930    39761     133        0             0 mysqld
[511250.459263] [30234]     0 30234    11365      125      22        0         -1000 systemd-udevd
[511250.459264] [11182]     0 11182    82780     5702     114        0             0 salt-minion
[511250.459266] [11193]     0 11193   171406     8289     144        0             0 salt-minion
[511250.459268] [11195]     0 11195   101432     5712     110        0             0 salt-minion
[511250.459269] [29678]  1004 29678   140301     7833     118        0             0 php-fpm
[511250.459271] [29998]  1004 29998   134983     2404     108        0             0 php-fpm
[511250.459273] [11833]     0 11833    69721     2098      58        0             0 python2.7
[511250.459275] [32113]    26 32113    60131     2012      56        0             0 postgres
[511250.459276] [ 1017]  1004  1017   135410     2748     108        0             0 php-fpm
[511250.459278] [11915]  1004 11915   144263    11778     126        0             0 php-fpm
[511250.459280] [ 5999]     0  5999     8115     3139      20        0             0 tagent
[511250.459281] [21572]  1004 21572   134919     2379     108        0             0 php-fpm
[511250.459283] [21752]  1004 21752   143751    11276     125        0             0 php-fpm
[511250.459285] [ 2977]  1004  2977   134920     2406     107        0             0 php-fpm
[511250.459286] [ 9217]     0  9217   330989   183882     550        0             0 python2.7
[511250.459288] [ 2008]  1004  2008   135816     3328     109        0             0 php-fpm
[511250.459290] [25089]  1000 25089  2800777   187701     710        0             0 java
[511250.459291] [25405]  1000 25405  1335611   105668     366        0             0 java
[511250.459293] [26033]  1000 26033  1680746    96082     367        0             0 java
[511250.459295] [26112]  1000 26112  1148121    61227     230        0             0 java
[511250.459296] [14446]     0 14446    31082      540      56        0             0 nginx
[511250.459298] [14447]  1004 14447    31278      739      58        0             0 nginx
[511250.459299] [14448]  1004 14448    31278      725      58        0             0 nginx
[511250.459301] [14449]  1004 14449    31278      714      58        0             0 nginx
[511250.459303] [14450]  1004 14450    31278      715      58        0             0 nginx
[511250.459304] [14451]  1004 14451    31245      705      58        0             0 nginx
[511250.459306] [14452]  1004 14452    31245      696      58        0             0 nginx
[511250.459307] [14453]  1004 14453    31278      712      58        0             0 nginx
[511250.459309] [14454]  1004 14454    31245      728      58        0             0 nginx
[511250.459310] [14455]  1004 14455    31278      730      58        0             0 nginx
[511250.459312] [14456]  1004 14456    31278      718      58        0             0 nginx
[511250.459314] [14457]  1004 14457    31245      707      58        0             0 nginx
[511250.459315] [14458]  1004 14458    31278      722      58        0             0 nginx
[511250.459317] [14459]  1004 14459    31278      717      58        0             0 nginx
[511250.459318] [14460]  1004 14460    31245      688      58        0             0 nginx
[511250.459320] [14462]  1004 14462    31278      712      58        0             0 nginx
[511250.459321] [14463]  1004 14463    31278      736      58        0             0 nginx
[511250.459323] [14571]     0 14571  3222105   119555     906        0             0 python
[511250.459325] [13969]     0 13969   134928     8719     143        0             0 salt-master
[511250.459326] [13982]     0 13982    78554     5647     100        0             0 salt-master
[511250.459328] [13985]     0 13985   116150     8034     134        0             0 salt-master
[511250.459330] [13989]     0 13989   151040    38826     238        0             0 salt-master
[511250.459331] [13990]     0 13990   103527    12904     148        0             0 salt-master
[511250.459333] [14067]     0 14067   280592     9651     151        0             0 salt-master
[511250.459334] [14072]     0 14072   135099     9889     141        0             0 salt-master
[511250.459336] [14220]     0 14220   134928     8828     135        0             0 salt-master
[511250.459338] [14221]     0 14221  1941362     9675     332        0             0 salt-master
[511250.459339] [14228]     0 14228   175360     9657     148        0             0 salt-master
[511250.459341] [14268]     0 14268   175362     9655     148        0             0 salt-master
[511250.459343] [14314]     0 14314   175361     9662     148        0             0 salt-master
[511250.459344] [14327]     0 14327   175363     9663     148        0             0 salt-master
[511250.459346] [14329]     0 14329   175363     9666     148        0             0 salt-master
[511250.459347] [14330]     0 14330   175364     9666     148        0             0 salt-master
[511250.459349] [14331]     0 14331   175365     9666     148        0             0 salt-master
[511250.459350] [14334]     0 14334   175366     9670     148        0             0 salt-master
[511250.459352] [14338]     0 14338   175366     9669     148        0             0 salt-master
[511250.459354] [14340]     0 14340   175366     9674     148        0             0 salt-master
[511250.459355] [14345]     0 14345   175367     9679     148        0             0 salt-master
[511250.459357] [14349]     0 14349   175367     9675     148        0             0 salt-master
[511250.459358] [14350]     0 14350   175367     9671     148        0             0 salt-master
[511250.459360] [14354]     0 14354   175368     9672     148        0             0 salt-master
[511250.459362] [14357]     0 14357   175369     9678     148        0             0 salt-master
[511250.459363] [14358]     0 14358   175369     9673     148        0             0 salt-master
[511250.459365] [14362]     0 14362   175369     9677     148        0             0 salt-master
[511250.459366] [14364]     0 14364   175370     9680     148        0             0 salt-master
[511250.459368] [14365]     0 14365   175371     9681     148        0             0 salt-master
[511250.459369] [14368]     0 14368   175371     9676     148        0             0 salt-master
[511250.459371] [14370]     0 14370   175371     9674     148        0             0 salt-master
[511250.459372] [14372]     0 14372   175372     9682     148        0             0 salt-master
[511250.459374] [14376]     0 14376   175373     9682     148        0             0 salt-master
[511250.459375] [14377]     0 14377   175374     9676     148        0             0 salt-master
[511250.459377] [14378]     0 14378   175374     9689     148        0             0 salt-master
[511250.459379] [14380]     0 14380   175650     9716     149        0             0 salt-master
[511250.459381] [14384]     0 14384   175375     9690     148        0             0 salt-master
[511250.459382] [14385]     0 14385   175375     9685     148        0             0 salt-master
[511250.459384] [14401]     0 14401   175376     9687     148        0             0 salt-master
[511250.459385] [14404]     0 14404   175377     9685     148        0             0 salt-master
[511250.459387] [14413]     0 14413   175377     9685     148        0             0 salt-master
[511250.459388] [14420]     0 14420   175377     9687     148        0             0 salt-master
[511250.459390] [14421]     0 14421   175378     9686     148        0             0 salt-master
[511250.459392] [14424]     0 14424   175380     9693     148        0             0 salt-master
[511250.459393] [14428]     0 14428   175380     9689     148        0             0 salt-master
[511250.459395] [14435]     0 14435   175382     9698     148        0             0 salt-master
[511250.459396] [14437]     0 14437   175382     9694     148        0             0 salt-master
[511250.459398] [14439]     0 14439   175383     9692     148        0             0 salt-master
[511250.459399] [14442]     0 14442   175384     9694     148        0             0 salt-master
[511250.459401] [14445]     0 14445   175385     9692     148        0             0 salt-master
[511250.459403] [14465]     0 14465   175385     9695     148        0             0 salt-master
[511250.459404] [14473]     0 14473   175385     9695     148        0             0 salt-master
[511250.459406] [14486]     0 14486   175386     9697     148        0             0 salt-master
[511250.459407] [14489]     0 14489   175386     9699     148        0             0 salt-master
[511250.459409] [14503]     0 14503   175386     9699     148        0             0 salt-master
[511250.459410] [14513]     0 14513   175387     9700     148        0             0 salt-master
[511250.459412] [14520]     0 14520   175388     9704     148        0             0 salt-master
[511250.459414] [14523]     0 14523   175389     9700     148        0             0 salt-master
[511250.459415] [14525]     0 14525   175389     9703     148        0             0 salt-master
[511250.459417] [14527]     0 14527   175390     9710     148        0             0 salt-master
[511250.459419] [14533]     0 14533   175390     9705     148        0             0 salt-master
[511250.459420] [14539]     0 14539   175390     9709     148        0             0 salt-master
[511250.459422] [14590]     0 14590   175391     9713     148        0             0 salt-master
[511250.459423] [14598]     0 14598   175390     9705     148        0             0 salt-master
[511250.459425] [14613]     0 14613   175391     9705     148        0             0 salt-master
[511250.459426] [14624]     0 14624   175392     9713     148        0             0 salt-master
[511250.459428] [14630]     0 14630   175392     9707     148        0             0 salt-master
[511250.459429] [14634]     0 14634   175393     9707     148        0             0 salt-master
[511250.459431] [14652]     0 14652   175393     9709     148        0             0 salt-master
[511250.459433] [14677]     0 14677   175394     9708     148        0             0 salt-master
[511250.459434] [14679]     0 14679   175394     9711     148        0             0 salt-master
[511250.459436] [14709]     0 14709   175395     9713     148        0             0 salt-master
[511250.459438] [14718]     0 14718   175396     9710     148        0             0 salt-master
[511250.459439] [14723]     0 14723   175396     9710     148        0             0 salt-master
[511250.459441] [14746]     0 14746   175396     9716     148        0             0 salt-master
[511250.459443] [14752]     0 14752   175461     9717     148        0             0 salt-master
[511250.459444] [14791]     0 14791   175398     9715     148        0             0 salt-master
[511250.459446] [14799]     0 14799   175397     9720     148        0             0 salt-master
[511250.459447] [14804]     0 14804   175472     9721     148        0             0 salt-master
[511250.459449] [14835]     0 14835   175462     9729     148        0             0 salt-master
[511250.459450] [14840]     0 14840   175463     9735     148        0             0 salt-master
[511250.459452] [14864]     0 14864   175463     9727     148        0             0 salt-master
[511250.459453] [14882]     0 14882   175464     9731     148        0             0 salt-master
[511250.459455] [14893]     0 14893   175465     9731     148        0             0 salt-master
[511250.459456] [14899]     0 14899   175465     9720     148        0             0 salt-master
[511250.459458] [14906]     0 14906   175466     9721     148        0             0 salt-master
[511250.459460] [14910]     0 14910   175402     9723     148        0             0 salt-master
[511250.459461] [14984]     0 14984   175466     9725     148        0             0 salt-master
[511250.459463] [14988]     0 14988   175467     9735     148        0             0 salt-master
[511250.459464] [14992]     0 14992   175468     9734     148        0             0 salt-master
[511250.459466] [15072]     0 15072   175468     9735     148        0             0 salt-master
[511250.459467] [15101]     0 15101   175468     9731     148        0             0 salt-master
[511250.459469] [15129]     0 15129   175469     9733     148        0             0 salt-master
[511250.459470] [15143]     0 15143   175469     9737     148        0             0 salt-master
[511250.459472] [15168]     0 15168   175470     9740     148        0             0 salt-master
[511250.459474] [15181]     0 15181   175474     9744     148        0             0 salt-master
[511250.459475] [15219]     0 15219   175474     9734     148        0             0 salt-master
[511250.459477] [15223]     0 15223   175477     9753     148        0             0 salt-master
[511250.459479] [15259]     0 15259   175475     9734     148        0             0 salt-master
[511250.459481] [15266]     0 15266   175476     9735     148        0             0 salt-master
[511250.459482] [15322]     0 15322   175476     9736     148        0             0 salt-master
[511250.459493] [15350]     0 15350   175476     9745     148        0             0 salt-master
[511250.459495] [15366]     0 15366   175477     9743     148        0             0 salt-master
[511250.459497] [15380]     0 15380   175506     9745     148        0             0 salt-master
[511250.459498] [15399]     0 15399   175754     9769     149        0             0 salt-master
[511250.459500] [15407]     0 15407   175479     9747     148        0             0 salt-master
[511250.459501] [15447]     0 15447   175479     9742     148        0             0 salt-master
[511250.459503] [15450]     0 15450   175479     9751     148        0             0 salt-master
[511250.459504] [15454]     0 15454   175481     9747     148        0             0 salt-master
[511250.459506] [15462]     0 15462   175480     9748     148        0             0 salt-master
[511250.459508] [23316]  1000 23316  3085650    27853     144        0             0 java
[511250.459509] [23319]  1000 23319  3085650    27289     144        0             0 java
[511250.459511] [23348]  1000 23348  3085650    27778     142        0             0 java
[511250.459512] [23351]  1000 23351  3085650    26840     141        0             0 java
[511250.459514] [23373]  1000 23373  3085650    27380     143        0             0 java
[511250.459515] [23406]  1000 23406  3085650    26933     143        0             0 java
[511250.459517] [23425]  1000 23425  3085650    27371     142        0             0 java
[511250.459518] [23445]  1000 23445  3085650    27861     141        0             0 java
[511250.459520] [23476]  1000 23476  3085650    27716     143        0             0 java
[511250.459522] [23497]  1000 23497  3085650    27902     144        0             0 java
[511250.459523] [23690]  1000 23690  2049475   328916     865        0             0 java
[511250.459525] [23691]  1000 23691  2082756   356868     894        0             0 java
[511250.459527] [23693]  1000 23693  2027460   612751    1357        0             0 java
[511250.459528] [23712]  1000 23712  2027460   610571    1348        0             0 java
[511250.459529] [23754]  1000 23754  2049474   337457     886        0             0 java
[511250.459531] [23785]  1000 23785  2049474   330831     864        0             0 java
[511250.459533] [23805]  1000 23805  2027460   615907    1366        0             0 java
[511250.459534] [23828]  1000 23828  2027460   610191    1346        0             0 java
[511250.459536] [23855]  1000 23855  2629446   589971    1351        0             0 java
[511250.459537] [23860]  1000 23860  2328022   144465     519        0             0 java
[511250.459539] [13536]  1004 13536   134981     2523     108        0             0 php-fpm
[511250.459540] [ 1813]     0  1813  1481817    46140     246        0             0 java
[511250.459542] [ 3187]     0  3187  1481817    53461     253        0             0 java
[511250.459544] [ 2993]    26  2993    59779     1712      55        0             0 postgres
[511250.459546] [ 3059]  1000  3059  3085528    16411     141        0             0 java
[511250.459547] [ 3146]  1000  3146  2027460   211779     628        0             0 java
[511250.459549] [17982]   996 17982  4950828   635077    1629        0             0 java
[511250.459551] [16433]     0 16433    37607      360      74        0             0 sshd
[511250.459553] [16436]     0 16436    29390      141      13        0             0 bash
[511250.459554] [16466]     0 16466    29390      136      14        0             0 bash
[511250.459556] [22511]     0 22511    36968      433      72        0             0 sshd
[511250.459558] [22515]     0 22515    19016      257      40        0             0 ssh
[511250.459560] [22519]     0 22519    19107      350      39        0             0 ssh
[511250.459562] [22522]     0 22522    19016      259      38        0             0 ssh
[511250.459563] [24770]     0 24770    38342      657      30        0             0 vim
[511250.459565] [24781]     0 24781    45009      303      41        0             0 crond
[511250.459566] [24784]     0 24784    91360     8641     134        0             0 python
[511250.459568] [24932]     0 24932    28791       45      13        0             0 sh
[511250.459570] [24933]     0 24933    93538     7284     104        0             0 ansible-playboo
[511250.459571] [24942]     0 24942    94424     7584     103        0             0 ansible-playboo
[511250.459573] [24943]     0 24943    96455     9707     107        0             0 ansible-playboo
[511250.459574] [24944]     0 24944    94436     7599     103        0             0 ansible-playboo
[511250.459576] [24945]     0 24945    16336       70      33        0             0 ssh
[511250.459578] [24946]     0 24946    16336       71      33        0             0 ssh
[511250.459579] [24947]     0 24947    16336       69      30        0             0 ssh
[511250.459581] Out of memory: Kill process 17982 (java) score 77 or sacrifice child
[511250.459642] Killed process 17982 (java) total-vm:19803312kB, anon-rss:2540308kB, file-rss:0kB, shmem-rss:0kB

(1)mysqld觸發了oom killer,既mysqld要申請的內存大於了系統可用的物理內存大小。/proc/sys/vm/min_free_kbytes參數來控制,當系統可用內存(不包含buffer和cache)小於這個值的時候,系統會啓動內核線程kswapd來對內存進行回收。而還是觸發了oom killer,則表明內存真的不夠用了或者在內存回收前或者回收中直接觸發了oom killer。

(2)如下的輸出表明瞭申請了3次內存都沒有成功

[511250.459112] Node 0 DMA free:15892kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[511250.459117] lowmem_reserve[]: 0 2814 31994 31994
[511250.459120] Node 0 DMA32 free:119704kB min:5940kB low:7424kB high:8908kB active_anon:2678512kB inactive_anon:276kB active_file:124kB inactive_file:132kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129216kB managed:2883436kB mlocked:0kB dirty:0kB writeback:0kB mapped:1100kB shmem:1632kB slab_reclaimable:48796kB slab_unreclaimable:9340kB kernel_stack:5248kB pagetables:11424kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32902 all_unreclaimable? yes
[511250.459124] lowmem_reserve[]: 0 0 29180 29180
[511250.459127] Node 0 Normal free:63896kB min:61608kB low:77008kB high:92412kB active_anon:29011996kB inactive_anon:6336kB active_file:6576kB inactive_file:11148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29881068kB mlocked:0kB dirty:44kB writeback:8kB mapped:13328kB shmem:16776kB slab_reclaimable:198472kB slab_unreclaimable:94604kB kernel_stack:53472kB pagetables:159076kB unstable:0kB bounce:0kB free_pcp:656kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:924 all_unreclaimable? no
[511250.459131] lowmem_reserve[]: 0 0 0 0
[511250.459134] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15892kB
[511250.459144] Node 0 DMA32: 9372*4kB (UEM) 2427*8kB (UEM) 1179*16kB (UEM) 369*32kB (UEM) 104*64kB (EM) 31*128kB (EM) 14*256kB (UEM) 9*512kB (UEM) 7*1024kB (UEM) 3*2048kB (M) 0*4096kB = 119704kB
[511250.459154] Node 0 Normal: 1540*4kB (UE) 6148*8kB (UE) 503*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 63392kB
[511250.459162] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[511250.459163] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[511250.459164] 9275 total pagecache pages

(3)被幹掉進程信息

如需輸出確認了被kill的進程爲17982

[511250.459581] Out of memory: Kill process 17982 (java) score 77 or sacrifice child
[511250.459642] Killed process 17982 (java) total-vm:19803312kB, anon-rss:2540308kB, file-rss:0kB, shmem-rss:0kB

如下爲17982進程佔用的內存頁數量635077,換算爲內存佔用量是635077*4096=2GB

[511250.459172] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[511250.459549] [17982]   996 17982  4950828   635077    1629        0

每列的含義爲:

  • pid進程ID。

  • uid用戶ID。

  • tgid線程組ID。

  • total_vm虛擬內存使用(單位爲4 kB內存頁)

  • rss居民 memory 使用(單位4 kB內存頁)

  • nr_ptes頁表項

  • swapents交換條目

  • oom_score_adj通常爲0;較低的數字表示當調用OOM殺手時,進程將不太可能死亡。

(4)分析系統所有進程rss內存(rss爲程序實際使用物理內存,單位爲4kB內存頁)

把oom輸出中進程的rss內存相加,發現已經使用了32g,那就說明系統是內存耗盡了才觸發的oom killer。而通過分析,發現java程序佔用的的內存總量爲26g,是最大頭。

 

三,解決

使用的解決辦法:

1,限制java進程的max heap,並且降低java程序的worker數量,從而降低內存使用

2,發現系統沒有開啓swap,給系統加了8G的swap空間

其它解決辦法(不推薦),不允許內存申請過量:

# echo "2" > /proc/sys/vm/overcommit_memory

# echo "80" > /proc/sys/vm/overcommit_ratio

 

四,擴展

1,overcommit_memory(/proc/sys/vm/overcommit_memory)

Linux是允許memory overcommit的,只要你來申請內存我就給你,寄希望於進程實際上用不到那麼多內存,但萬一用到那麼多了呢?那就會發生類似“銀行擠兌”的危機,現金(內存)不足了。Linux設計了一個OOM killer機制(OOM = out-of-memory)來處理這種危機:挑選一個進程出來殺死,以騰出部分內存,如果還不夠就繼續殺…也可通過設置內核參數 vm.panic_on_oom 使得發生OOM時自動重啓系統。這都是有風險的機制,重啓有可能造成業務中斷,殺死進程也有可能導致業務中斷。所以Linux 2.6之後允許通過內核參數 vm.overcommit_memory 禁止memory overcommit。

 

(1)內核參數 vm.overcommit_memory 接受三種取值:

0 – Heuristic overcommit handling. 這是缺省值,它允許overcommit,但過於明目張膽的overcommit會被拒絕,比如malloc一次性申請的內存大小就超過了系統總內存。Heuristic的意思是“試探式的”,內核利用某種算法猜測你的內存申請是否合理,它認爲不合理就會拒絕overcommit。

1 – Always overcommit. 允許overcommit,對內存申請來者不拒。內核執行無內存過量使用處理。使用這個設置會增大內存超載的可能性,但也可以增強大量使用內存任務的性能。

2 – Don’t overcommit. 禁止overcommit。 內存拒絕等於或者大於總可用 swap 大小以及 overcommit_ratio 指定的物理 RAM 比例的內存請求。如果您希望減小內存過度使用的風險,這個設置就是最好的。

 

(2)Heuristic overcommit算法在以下函數中實現,基本上可以這麼理解:

單次申請的內存大小不能超過 【free memory + free swap + pagecache的大小 + SLAB中可回收的部分】,否則本次申請就會失敗。

 

(3)關於禁止overcommit (vm.overcommit_memory=2) ,需要知道的是,怎樣纔算是overcommit呢?kernel設有一個閾值,申請的內存總數超過這個閾值就算overcommit,在/proc/meminfo中可以看到這個閾值的大小:

# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB

 

CommitLimit 就是overcommit的閾值,申請的內存總數超過CommitLimit的話就算是overcommit。

這個閾值是如何計算出來的呢?它既不是物理內存的大小,也不是free memory的大小,它是通過內核參數vm.overcommit_ratio或vm.overcommit_kbytes間接設置的,公式如下:
【CommitLimit = (Physical RAM * vm.overcommit_ratio / 100) + Swap】

注:
vm.overcommit_ratio 是內核參數,缺省值是50,表示物理內存的50%。如果你不想使用比率,也可以直接指定內存的字節數大小,通過另一個內核參數 vm.overcommit_kbytes 即可;
如果使用了huge pages,那麼需要從物理內存中減去,公式變成:
CommitLimit = ([total RAM] – [total huge TLB RAM]) * vm.overcommit_ratio / 100 + swap
參見https://access.redhat.com/solutions/665023

/proc/meminfo中的 Committed_AS 表示所有進程已經申請的內存總大小,(注意是已經申請的,不是已經分配的),如果 Committed_AS 超過 CommitLimit 就表示發生了 overcommit,超出越多表示 overcommit 越嚴重。Committed_AS 的含義換一種說法就是,如果要絕對保證不發生OOM (out of memory) 需要多少物理內存。

(4)“sar -r”是查看內存使用狀況的常用工具,它的輸出結果中有兩個與overcommit有關,kbcommit 和 %commit:
kbcommit對應/proc/meminfo中的 Committed_AS;
%commit的計算公式並沒有採用 CommitLimit作分母,而是Committed_AS/(MemTotal+SwapTotal),意思是_內存申請_佔_物理內存與交換區之和_的百分比。

# sar -r 
05:00:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
05:10:01 PM    160576   3648460     95.78         0   1846212   4939368     62.74   1390292   1854880         4

 

2,panic_on_oom(/proc/sys/vm/panic_on_oom)

決定系統出現oom的時候,要做的操作。接受的三種取值如下:

0 - 默認值,當出現oom的時候,觸發oom killer

1 - 程序在有cpuset、memory policy、memcg的約束情況下的OOM,可以考慮不panic,而是啓動OOM killer。其它情況觸發 kernel panic,即系統直接重啓

2 - 當出現oom,直接觸發kernel panic,即系統直接重啓

 

3,oom_adj、oom_score_adj和oom_score

準確的說這幾個參數都是和具體進程相關的,因此它們位於/proc/xxx/目錄下(xxx是進程ID)。假設我們選擇在出現OOM狀況的時候殺死進程,那麼一個很自然的問題就浮現出來:到底幹掉哪一個呢?內核的算法倒是非常簡單,那就是打分(oom_score,注意,該參數是read only的),找到分數最高的就OK了。那麼怎麼來算分數呢?可以參考內核中的oom_badness函數:

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, 
              const nodemask_t *nodemask, unsigned long totalpages) 
{……
    adj = (long)p->signal->oom_score_adj; 
    if (adj == OOM_SCORE_ADJ_MIN) {----------------------(1) 
        task_unlock(p); 
        return 0;---------------------------------(2) 
    }
    points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + 
        atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);---------(3) 
    task_unlock(p);

    if (has_capability_noaudit(p, CAP_SYS_ADMIN))-----------------(4) 
        points -= (points * 3) / 100;
    adj *= totalpages / 1000;----------------------------(5) 
    points += adj;  

    return points > 0 ? points : 1; 
}

(1)對某一個task進行打分(oom_score)主要有兩部分組成,一部分是系統打分,主要是根據該task的內存使用情況。另外一部分是用戶打分,也就是oom_score_adj了,該task的實際得分需要綜合考慮兩方面的打分。如果用戶將該task的 oom_score_adj設定成OOM_SCORE_ADJ_MIN(-1000)的話,那麼實際上就是禁止了OOM killer殺死該進程。

(2)這裏返回了0也就是告知OOM killer,該進程是“good process”,不要幹掉它。後面我們可以看到,實際計算分數的時候最低分是1分。

(3)前面說過了,系統打分就是看物理內存消耗量,主要是三部分,RSS部分,swap file或者swap device上佔用的內存情況以及頁表佔用的內存情況。

(4)root進程有3%的內存使用特權,因此這裏要減去那些內存使用量。

(5)用戶可以調整oom_score,具體如何操作呢?oom_score_adj的取值範圍是-1000~1000,0表示用戶不調整oom_score,負值表示要在實際打分值上減去一個折扣,正值表示要懲罰該task,也就是增加該進程的oom_score。在實際操作中,需要根據本次內存分配時候可分配內存來計算(如果沒有內存分配約束,那麼就是系統中的所有可用內存,如果系統支持cpuset,那麼這裏的可分配內存就是該cpuset的實際額度值)。oom_badness函數有一個傳入參數totalpages,該參數就是當時的可分配的內存上限值。實際的分數值(points)要根據oom_score_adj進行調整,例如如果oom_score_adj設定-500,那麼表示實際分數要打五折(基數是totalpages),也就是說該任務實際使用的內存要減去可分配的內存上限值的一半。

 

瞭解了oom_score_adj和oom_score之後,應該是塵埃落定了,oom_adj是一箇舊的接口參數,其功能類似oom_score_adj,爲了兼容,目前仍然保留這個參數,當操作這個參數的時候,kernel實際上是會換算成oom_score_adj,有興趣的同學可以自行了解,這裏不再細述了。

plus:

由任意調整的進程衍生的任意進程將繼承該進程的 oom_score。例如:如果 sshd 進程不受 oom_killer功能影響,所有由 SSH 會話產生的進程都將不受其影響。這可在出現 OOM 時影響 oom_killer 功能救援系統的能力。

 

4,min_free_kbytes(/proc/sys/vm/min_free_kbytes)

先看官方解釋:

This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

解釋已經很清楚了,主要有以下幾個關鍵點:

(1). 代表系統所保留空閒內存的最低限。

在系統初始化時會根據內存大小計算一個默認值,計算規則是:

  min_free_kbytes = sqrt(lowmem_kbytes * 16) = 4 * sqrt(lowmem_kbytes)(注:lowmem_kbytes即可認爲是系統內存大小)

另外,計算出來的值有最小最大限制,最小爲128K,最大爲64M。

可以看出,min_free_kbytes隨着內存的增大不是線性增長,comments裏提到了原因“because network bandwidth does not increase linearly with machine size”。隨着內存的增大,沒有必要也線性的預留出過多的內存,能保證緊急時刻的使用量便足矣。

 

(2).min_free_kbytes的主要用途是計算影響內存回收的三個參數 watermark[min/low/high]

1) watermark[high] > watermark [low] > watermark[min],各個zone各一套

2)在系統空閒內存低於 watermark[low]時,開始啓動內核線程kswapd進行內存回收(每個zone一個),直到該zone的空閒內存數量達到watermark[high]後停止回收。如果上層申請內存的速度太快,導致空閒內存降至watermark[min]後,內核就會進行direct reclaim(直接回收),即直接在應用程序的進程上下文中進行回收,再用回收上來的空閒頁滿足內存申請,因此實際會阻塞應用程序,帶來一定的響應延遲,而且可能會觸發系統OOM。這是因爲watermark[min]以下的內存屬於系統的自留內存,用以滿足特殊使用,所以不會給用戶態的普通申請來用。

3)三個watermark的計算方法:

 watermark[min] = min_free_kbytes換算爲page單位即可,假設爲min_free_pages。(因爲是每個zone各有一套watermark參數,實際計算效果是根據各個zone大小所佔內存總大小的比例,而算出來的per zone min_free_pages)
 watermark[low] = watermark[min] * 5 / 4
 watermark[high] = watermark[min] * 3 / 2

所以中間的buffer量爲 high - low = low - min = per_zone_min_free_pages * 1/4。因爲min_free_kbytes = 4* sqrt(lowmem_kbytes),也可以看出中間的buffer量也是跟內存的增長速度成開方關係。

4)可以通過/proc/zoneinfo查看每個zone的watermark

例如:

Node 0, zone      DMA
pages free     3960
       min      65
       low      81
       high     97

 

(3).min_free_kbytes大小的影響

min_free_kbytes設的越大,watermark的線越高,同時三個線之間的buffer量也相應會增加。這意味着會較早的啓動kswapd進行回收,且會回收上來較多的內存(直至watermark[high]纔會停止),這會使得系統預留過多的空閒內存,從而在一定程度上降低了應用程序可使用的內存量。極端情況下設置min_free_kbytes接近內存大小時,留給應用程序的內存就會太少而可能會頻繁地導致OOM的發生。

min_free_kbytes設的過小,則會導致系統預留內存過小。kswapd回收的過程中也會有少量的內存分配行爲(會設上PF_MEMALLOC)標誌,這個標誌會允許kswapd使用預留內存;另外一種情況是被OOM選中殺死的進程在退出過程中,如果需要申請內存也可以使用預留部分。這兩種情況下讓他們使用預留內存可以避免系統進入deadlock狀態。

 

5,lowmem與highmem

關於lowmem和highmem的定義在這裏就不詳細展開了,推薦兩篇文章:

http://ilinuxkernel.com/?p=1013

 

鏈接內講的比較清楚,這裏只說結論:

(1)當系統的物理內存 > 內核的地址空間範圍時,才需要引入highmem概念。

x86架構下,linux默認會把進程的虛擬地址空間(4G)按3:1拆分,0~3G user space通過頁表映射,3G-4G kernel space線性映射到進程高地址。就是說,x86機器的物理內存超過1G時,需要引入highmem概念。

(2)內核不能直接訪問1G以上的物理內存(因爲這部分內存沒法映射到內核的地址空間),當內核需要訪問1G以上的物理內存時,需要通過臨時映射的方式,把高地址的物理內存映射到內核可以訪問的地址空間裏。

(3)當lowmem被佔滿之後,就算剩餘的物理內存很大,還是會出現oom的情況。對於linux2.6來說,oom之後會根據score殺掉一個進程(oom的話題這裏不展開了)。

(4)x86_64架構下,內核可用的地址空間遠大於實際物理內存空間,所以目前沒有上面討論的highmem的問題,可以認爲系統內存等於lowmem。

 

6,lowmem_reserve_ratio(/proc/sys/vm/lowmem_reserve_ratio)

官方解釋:

For some specialised workloads on highmem machines it is dangerous for the kernel to allow process memory to be allocated from the "lowmem" zone. This is because that memory could then be pinned via the mlock() system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory can be fatal.

So the Linux page allocator has a mechanism which prevents allocations which _could_ use highmem from using too much lowmem. This means that a certain amount of lowmem is defended from the possibility of being captured into pinned user memory.

The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is in defending these lower zones.

If you have a machine which uses highmem or ISA DMA and your applications are using mlock(), or if you are running with no swap then you probably should change the lowmem_reserve_ratio setting.

(1).作用

除了min_free_kbytes會在每個zone上預留一部分內存外,lowmem_reserve_ratio是在各個zone之間進行一定的防衛預留,主要是防止高端zone在沒內存的情況下過度使用低端zone的內存資源。

例如現在常見的一個node的機器有三個zone: DMA,DMA32和NORMAL。DMA和DMA32屬於低端zone,內存也較小,如96G內存的機器兩個zone總和才1G左右,NORMAL就相對屬於高端內存(現在一般沒有HIGH zone),而且數量較大(>90G)。低端內存有一定的特殊作用比如發生DMA時只能分配DMA zone的低端內存,因此需要在 儘量可以使用高端內存時 而 不使用低端內存,同時防止高端內存分配不足的時候搶佔稀有的低端內存。

 

(2). 計算方法

# cat /proc/sys/vm/lowmem_reserve_ratio

256     256     32

 

內核利用上述的protection數組計算每個zone的預留page量,計算出來也是數組形式,從/proc/zoneinfo裏可以查看:

Node 0, zone      DMA
 pages free     1355
       min      3
       low      3
       high     4
       :
       :
   numa_other   0
       protection: (0, 2004, 2004, 2004)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 pagesets
   cpu: 0 pcp: 0
       :

在進行內存分配時,這些預留頁數值和watermark相加來一起決定現在是滿足分配請求,還是認爲空閒內存量過低需要啓動回收。

例如,如果一個normal區(index = 2)的頁申請來試圖分配DMA區的內存,且現在使用的判斷標準是watermark[low]時,內核計算出 page_free = 1355,而watermark + protection[2] = 3 + 2004 = 2007 > page_free,則認爲空閒內存太少而不予以分配。如果分配請求本就來自DMA zone,則 protection[0] = 0會被使用,而滿足分配申請。

zone[i] 的 protection[j] 計算規則如下:

(i < j):
 zone[i]->protection[j]
 = (total sums of present_pages from zone[i+1] to zone[j] on the node)
   / lowmem_reserve_ratio[i];
(i = j):
  (should not be protected. = 0;
(i > j):
  (not necessary, but looks 0)

默認的 lowmem_reserve_ratio[i] 值是:

   256 (if zone[i] means DMA or DMA32 zone)

   32  (others).

從上面的計算規則可以看出,預留內存值是ratio的倒數關係,如果是256則代表 1/256,即爲 0.39% 的高端zone內存大小。如果想要預留更多頁,應該設更小一點的值,最小值是1(1/1 -> 100%)。

 

(3). 和min_free_kbytes(watermark)的配合示例

下面是一段某線上服務器(96G)內存申請失敗時打印出的log:

[38905.295014] java: page allocation failure. order:1, mode:0x20, zone 2
[38905.295020] Pid: 25174, comm: java Not tainted 2.6.32-220.23.1.tb750.el5.x86_64 #1
...
[38905.295348] active_anon:5730961 inactive_anon:216708 isolated_anon:0
[38905.295349]  active_file:2251981 inactive_file:15562505 isolated_file:0
[38905.295350]  unevictable:1256 dirty:790255 writeback:0 unstable:0
[38905.295351]  free:113095 slab_reclaimable:577285 slab_unreclaimable:31941
[38905.295352]  mapped:7816 shmem:4 pagetables:13911 bounce:0
[38905.295355] Node 0 DMA free:15796kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB  isolated(anon):0kB isolated(file):0kB present:15332kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[38905.295365] lowmem_reserve[]: 0 1951 96891 96891
[38905.295369] Node 0 DMA32 free:380032kB min:800kB low:1000kB high:1200kB active_anon:46056kB inactive_anon:10876kB active_file:15968kB inactive_file:129772kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1998016kB mlocked:0kB dirty:20416kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:11716kB slab_unreclaimable:160kB kernel_stack:176kB pagetables:112kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:576 all_unreclaimable? no
[38905.295379] lowmem_reserve[]: 0 0 94940 94940
[38905.295383] Node 0 Normal free:56552kB min:39032kB low:48788kB high:58548kB active_anon:22877788kB inactive_anon:855956kB active_file:8991956kB inactive_file:62120248kB unevictable:5024kB isolated(anon):0kB isolated(file):0kB present:97218560kB mlocked:5024kB dirty:3140604kB writeback:0kB mapped:31264kB shmem:16kB slab_reclaimable:2297424kB slab_unreclaimable:127604kB kernel_stack:12528kB pagetables:55532kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[38905.295393] lowmem_reserve[]: 0 0 0 0
[38905.295396] Node 0 DMA: 1*4kB 2*8kB 0*16kB 1*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15796kB
[38905.295405] Node 0 DMA32: 130*4kB 65*8kB 75*16kB 72*32kB 95*64kB 22*128kB 10*256kB 7*512kB 4*1024kB 2*2048kB 86*4096kB = 380032kB
[38905.295414] Node 0 Normal: 12544*4kB 68*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 54816kB
[38905.295423] 17816926 total pagecache pages

1)從第一行log“order:1, mode:0x20”可以看出來是GFP_ATOMIC類型的申請,且order = 1(page = 2 )

2)第一次內存申請嘗試

在__alloc_pages_nodemask()裏,首先調用 get_page_from_freelist() 嘗試第一次申請,使用的標誌位是 ALLOC_WMARK_LOW|ALLOC_CPUSET,它會對每個zone都做 zone_watermark_ok()的檢查,使用的就是傳進的watermark[low]閾值。

在zone_watermark_ok()裏會考慮z->lowmem_reserve[],導致在normal上的申請不會落到低端zone。比如對於DMA32:

free pages = 380032KB = 95008 pages < low(1000KB = 250 pages) +  lowmem_reserve[normal](94940) = 95190

所以就認爲DMA32也不平不ok,同理更用不了DMA的內存。

而對於normal自己內存來說,free pages = 56552 KB = 14138 pages,也不用考慮lowmem_reserve(0),但這時還會考慮申請order(1),減去order 0的12544個page後只剩 14138 - 12544 = 1594,也小於 low / 2 = (48788KB=12197pages) / 2 = 6098 pages。

所以初次申請嘗試失敗,進入__alloc_pages_slowpath() 嘗試進行更爲積極一些的申請。

3)第二次內存申請嘗試

__alloc_pages_slowpath()首先是通過 gfp_to_alloc_flags() 修改alloc_pages,設上更爲強硬的標誌位。這塊根據原來的GFP_ATOMIC會設上 ALLOC_WMARK_MIN | ALLOC_HARDER | ALLOC_HIGH。但注意的是不會設上 ALLOC_NO_WATERMARKS 標誌位。這個標誌位不再判斷zone的水位限制,屬於優先級最高的申請,可以動用所有的reserve內存,但條件是(!in_interrupt() && ((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))),即要求不能在中斷上下文,且是正在進行回收(例如kswapd)或者正在退出的進程。

之後進入拿着新的alloc_pages重新進入get_page_from_pagelist() 嘗試第二次申請,雖然有了 ALLOC_HARDER和ALLOC_HIGH,但是不幸的是在3個zone的zone_watermark_ok檢查中還是都無法通過,例如對於DMA32:

free pages = 380032KB = 95008 pages

因爲設上了ALLOC_HIGH 所以會將得到的watermark[min]減半,即min = min/2 = 800K / 2 = 400K = 100pages

而又因爲設上了ALLOC_HARDER,會再將min砍去1/4,即min = 3 * min / 4 = 100 pages * 3 / 4 = 75 pages

即便如此,min(75 pages) +  lowmem_reserve[normal](94940) = 95015,仍大於free pages,仍認爲無法分配內存,同理DMA也不不成功,而normal中 free pages裏連續8K的頁太少也無法滿足分配

第二次失敗後,由於沒有ALLOC_NO_WATERMARK也不會進入__alloc_pages_high_priority 進行最高優先級的申請,同時由於是GFP_ATOMIC類型的分配不能阻塞回收或者進入OOM,因此就以申請失敗告終。

 

遇到此種情況可以適當調高 min_free_kbytes 使kswapd較早啓動回收,使系統一直留有較多的空閒內存,同時可以適度降低 lowmem_reserve_ratio(可選),使得內存不足的情況下(主要是normal zone)可以借用DMA32/DMA的內存救急(注意不能也不能過低)。

 

參考:

http://kernel.taobao.org/index.php?title=Kernel_Documents/mm_sysctl

http://ilinuxkernel.com/?p=1013

https://www.kernel.org/doc/Documentation/sysctl/vm.txt

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章