postgres: page allocation failure. order:1, mode:0x20

今天遇到GPseg 的postgre進程被OS kill -6,查看系統的日誌發現  page allocation failure.。

當時看到的服務器的內存情況如下圖:

報錯信息:

Jun 11 10:03:49 P1QMSSDW10 kernel: postgres: page allocation failure. order:1, mode:0x20
Jun 11 10:03:49 P1QMSSDW10 kernel: Pid: 9234, comm: postgres Tainted: G        W  ---------------    2.6.32-504.el6.x86_64 #1
Jun 11 10:03:49 P1QMSSDW10 kernel: Call Trace:
Jun 11 10:03:49 P1QMSSDW10 kernel: <IRQ>  [<ffffffff8113438a>] ? __alloc_pages_nodemask+0x74a/0x8d0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81173332>] ? kmem_getpages+0x62/0x170
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81173f4a>] ? fallback_alloc+0x1ba/0x270
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8117399f>] ? cache_grow+0x2cf/0x320
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81173cc9>] ? ____cache_alloc_node+0x99/0x160
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81174c4b>] ? kmem_cache_alloc+0x11b/0x190
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8144c768>] ? sk_prot_alloc+0x48/0x1c0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8144d992>] ? sk_clone+0x22/0x2e0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff814a1b76>] ? inet_csk_clone+0x16/0xd0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff814bb713>] ? tcp_create_openreq_child+0x23/0x470
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff814b8ecd>] ? tcp_v4_syn_recv_sock+0x4d/0x310
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff814bb4b6>] ? tcp_check_req+0x226/0x460
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff814b890b>] ? tcp_v4_do_rcv+0x35b/0x490
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81089524>] ? mod_timer+0x144/0x220
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff814ba1a2>] ? tcp_v4_rcv+0x522/0x900
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81496ded>] ? ip_local_deliver_finish+0xdd/0x2d0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81497078>] ? ip_local_deliver+0x98/0xa0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8149653d>] ? ip_rcv_finish+0x12d/0x440
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81496ac5>] ? ip_rcv+0x275/0x350
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8145c88b>] ? __netif_receive_skb+0x4ab/0x750
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81460588>] ? netif_receive_skb+0x58/0x60
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81460690>] ? napi_skb_finish+0x50/0x70
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81461f69>] ? napi_gro_receive+0x39/0x50
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffffa0153bda>] ? ixgbe_clean_rx_irq+0x26a/0xc90 [ixgbe]
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffffa0159693>] ? ixgbe_poll+0x453/0x7e0 [ixgbe]
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81462083>] ? net_rx_action+0x103/0x2f0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff810eaa90>] ? handle_IRQ_event+0x60/0x170
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8107d90f>] ? __do_softirq+0x11f/0x1e0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8107d765>] ? irq_exit+0x85/0x90
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff81533b45>] ? do_IRQ+0x75/0xf0
Jun 11 10:03:49 P1QMSSDW10 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Jun 11 10:14:22 P1QMSSDW10 abrt[56116]: Not saving repeating crash in '/opt/greenplum/greenplum-db-4.3.27.0/bin/postgres'
Jun 11 10:14:22 P1QMSSDW10 abrt[56118]: Not saving repeating crash in '/opt/greenplum/greenplum-db-4.3.27.0/bin/postgres'
Jun 11 10:14:35 P1QMSSDW10 abrt[56119]: Saved core dump of pid 56117 (/opt/greenplum/greenplum-db-4.3.27.0/bin/postgres) to /var/spool/abrt/ccpp-2020-06-11-10:14:22-56117 (1550864384 bytes)

查看coreDump reason

 ccpp-2020-06-11-10:14:22-56117]# more reason 
Process /opt/greenplum/greenplum-db-4.3.27.0/bin/postgres was killed by signal 6 (SIGABRT)

又是kill -6,已絕望,上次master 發生同樣的問題。 

 看到德哥的這篇文章前來拜讀。

背景

Linux把物理內存劃分爲三個層次來管理:存儲節點(Node)、管理區(Zone)和頁面(Page)。

每一個Node,系統又將其分爲多個Zone,x86架構下,node被分爲ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM,而64位x86架構下有ZONE_DMA、ZONE_DMA32、ZONE_NORMAL。

它們之間的是樹狀的包含關係:

pic

位於低位的ZONE,內存較少。

對於x86架構的系統來說,物理內存被劃分爲:

類型 地址範圍
ZONE_DMA 前16MB內存
ZONE_NORMAL 16MB – 896MB
ZONE_HIGHMEM 896 MB以上

ZONE_DMA位於低端的內存空間,用於某些舊的ISA設備。ZONE_NORMAL的內存直接映射到Linux內核線性地址空間的高端部分。

對於x86_64架構的系統來說:

類型 地址範圍
ZONE_DMA 前16MB內存
ZONE_DMA32 16MB – 4G
ZONE_NORMAL 4G以上

和x86相比,多了ZONE_DMA32,少了ZONE_HIGHMEM.

分配物理內存時,kernel由high zone到low zone依次查找是否有足夠的內存可以分配,找到可用的內存後映射到虛擬地址上。

但是低位的ZONE內存較少,所以如果低位的ZONE被佔滿之後,就算剩餘的物理內存很大,還是會出現oom的情況。對於linux2.6來說,oom之後會根據score殺掉一個進程(oom的話題這裏不展開了)。

這就引入了對低位ZONE的保護措施,lowmem_reserve_ratio。

如果操作系統的內存很多(比如64GB以上),可以將低位的ZONE的保護加到最大。

例如

sysctl -w vm.lowmem_reserve_ratio='1 1 1'

echo "vm.lowmem_reserve_ratio='1 1 1'" >/etc/sysctl.conf  

一臺192G內存的機器,X86_64 OS,DMA=16MB,DMA32=4GB,NORMAL=剩餘的內存。

那麼當設置了vm.lowmem_reserve_ratio='1 1 1'後,每一個ZONE的protection數組將變成如下,數組中的對應位置的值,表示當對應的高級ZONE來這個ZONE申請內存時,這個ZONE的剩餘內存必須大於這個保護值。

#free -m
             total       used       free     shared    buffers     cached
Mem:        193031     190099       2931          0        288     168877
-/+ buffers/cache:      20933     172098
Swap:            0          0          0

#cat /proc/zoneinfo |grep pro
dma:        protection: (0, 498744, 49625144, 49625144)
dma32:        protection: (0, 0, 49126400, 49126400)
normal:        protection: (0, 0, 0, 0)

由於normal已經是最高的ZONE,所以沒有更高的ZONE會來它這裏申請內存,它的保護值都是0,表示都可以申請.  

如DMA, 498744*4K表示DMA32來它這裏申請的保護值,當DMA ZONE的剩餘內存大於這個保護值時,才允許DMA32來它這裏申請內存。

49625144*4K表示NORMAL來它這裏申請的保護值,當DMA ZONE的剩餘內存大於這個保護值時,才允許NORMAL來它這裏申請內存。

這些值是怎麼計算的? 根據對應保護位的高位ZONE的內存大小,除以被保護的係數。 如DMA的保護係數爲1,即vm.lowmem_reserve_ratio的第一個元素值,那麼DMA32來DMA申請內存的保護值=DMA32的內存大小/1,NORMAL來DMA申請內存的保護值=NORMAL的內存大小/1 。

每個ZONE的內存大小(單位4K):

#cat /proc/zoneinfo |grep span
dma:        spanned  4095
dma32:        spanned  1044480
normal:        spanned  49807360

低位內存不足導致的page allocation failure

例如,某臺主機,發生了一些這樣的報錯,導致進程被OOM

https://www.postgresql.org/message-id/flat/4301.138.23.210.20.1135194176.squirrel%40www.cs.ucr.edu#[email protected]

Dec 20 17:14:57 server4 kernel:  postmaster: page allocation failure. order:0, mode:0xd0    
Dec 20 17:14:57 server4 kernel:  [<c0143271>] __alloc_pages+0x2e1/0x2f7    
Dec 20 17:14:57 server4 kernel:  [<c014329f>] __get_free_pages+0x18/0x24    
Dec 20 17:14:57 server4 kernel:  [<c0145bfc>] kmem_getpages+0x1c/0xbb    
Dec 20 17:14:57 server4 kernel:  [<c014674a>] cache_grow+0xab/0x138    
Dec 20 17:14:57 server4 kernel:  [<c014693c>] cache_alloc_refill+0x165/0x19d    
Dec 20 17:14:57 server4 kernel:  [<c0146b37>] kmem_cache_alloc+0x51/0x57    
Dec 20 17:14:57 server4 kernel:  [<c0142019>] mempool_alloc+0xb2/0x135    
Dec 20 17:14:57 server4 kernel:  [<c011fec9>] autoremove_wake_function+0x0/0x2d    
Dec 20 17:14:57 server4 kernel:  [<c011fec9>] autoremove_wake_function+0x0/0x2d    
Dec 20 17:14:57 server4 kernel:  [<c015de0e>] bio_alloc+0x15/0x168    
Dec 20 17:14:57 server4 kernel:  [<c026c48f>] sync_page_io+0x25/0xa2    
Dec 20 17:14:57 server4 kernel:  [<c026d97c>] write_disk_sb+0x5a/0x86    
Dec 20 17:15:01 server4 kernel:  [<c026d9ca>] sync_sbs+0x22/0x2f    
Dec 20 17:15:01 server4 kernel:  [<c026da5b>] md_update_sb+0x84/0xc6    
Dec 20 17:15:01 server4 kernel:  [<c02706b5>] md_write_start+0x5e/0x8c    
Dec 20 17:15:01 server4 kernel:  [<f882faf7>] make_request+0x22a/0x2b3 [raid1]    
Dec 20 17:15:01 server4 kernel:  [<c02232e4>] generic_make_request+0x18e/0x19e    
Dec 20 17:15:01 server4 kernel:  [<c02233be>] submit_bio+0xca/0xd2    
Dec 20 17:15:01 server4 kernel:  [<c0144812>] test_set_page_writeback+0xad/0xe1    
Dec 20 17:15:01 server4 kernel:  [<c0152ad7>] swap_writepage+0x9a/0xa3    
Dec 20 17:15:01 server4 kernel:  [<c01487ed>] pageout+0x8d/0xcc    
Dec 20 17:15:01 server4 kernel:  [<c0148a33>] shrink_list+0x207/0x3ed    
Dec 20 17:15:01 server4 kernel:  [<c0147cb4>] __pagevec_release+0x15/0x1d    
Dec 20 17:15:01 server4 kernel:  [<c0148df6>] shrink_cache+0x1dd/0x34d    
Dec 20 17:15:01 server4 kernel:  [<c01494b4>] shrink_zone+0xa7/0xb6    
Dec 20 17:15:01 server4 kernel:  [<c014950f>] shrink_caches+0x4c/0x57    
Dec 20 17:15:01 server4 kernel:  [<c0149606>] try_to_free_pages+0xc3/0x1a7    
Dec 20 17:15:01 server4 kernel:  [<c014318e>] __alloc_pages+0x1fe/0x2f7    
Dec 20 17:15:01 server4 kernel:  [<c014329f>] __get_free_pages+0x18/0x24    
Dec 20 17:15:01 server4 kernel:  [<c0145bfc>] kmem_getpages+0x1c/0xbb    
Dec 20 17:15:01 server4 kernel:  [<c014674a>] cache_grow+0xab/0x138    

很可能就是低位內存不足導致的。

一些概念(摘自互聯網)

http://blog.2baxb.me/archives/1065

http://blog.csdn.net/kickxxx/article/details/8835733

lowmem與highmem

關於lowmem和highmem的定義在這裏就不詳細展開了,推薦兩篇文章:

Documentation/vm/highmem.txt

Linux內核高端內存

鏈接內講的比較清楚,這裏只說結論:

1. 當系統的物理內存 > 內核的地址空間範圍時,才需要引入highmem概念。

2. x86架構下,linux默認會把進程的虛擬地址空間(4G)按3:1拆分,0~3G user space通過頁表映射,3G-4G kernel space線性映射到進程高地址。就是說,x86機器的物理內存超過1G時,需要引入highmem概念。

3. 內核不能直接訪問1G以上的物理內存(因爲這部分內存沒法映射到內核的地址空間),當內核需要訪問1G以上的物理內存時,需要通過臨時映射的方式,把高地址的物理內存映射到內核可以訪問的地址空間裏。

4. 當lowmem被佔滿之後,就算剩餘的物理內存很大,還是會出現oom的情況。對於linux2.6來說,oom之後會根據score殺掉一個進程(oom的話題這裏不展開了)。

5. x86_64架構下,內核可用的地址空間遠大於實際物理內存空間,所以目前沒有上面討論的highmem的問題。

linux的物理內存管理

接下來的問題是,linux是如何實現highmem的概念的?

Linux把物理內存劃分爲三個層次來管理:存儲節點(Node)、管理區(Zone)和頁面(Page)。

在NUMA架構下,系統根據CPU的物理顆數,將物理內存分成對應的Node,這裏也不展開了,可以參考

NUMA (Non-Uniform Memory Access): An Overview

每一個Node,系統又將其分爲多個Zone,x86架構下,node被分爲ZONE_DMA、ZONE_NORMAL、ZONE_HIGHMEM,而64位x86架構下有ZONE_DMA(ZONE_DMA32)和ZONE_NORMAL。它們之間的是樹狀的包含關係:

pic

可以通過以下命令查看numa node信息:

$ numactl --hardware    
available: 2 nodes (0-1)    
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22    
node 0 size: 8114 MB    
node 0 free: 2724 MB    
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23    
node 1 size: 8192 MB    
node 1 free: 818 MB    
node distances:    
node   0   1    
  0:  10  20    
  1:  20  10    

可以通過以下命令查看zone信息,注意單位是page(4k):

$ cat /proc/zoneinfo    
Node 0, zone      DMA    
  pages free     3933    
        min      20    
        low      25    
        high     30    
        scanned  0    
        spanned  4095    
        present  3834    

結合之前關於highmem的說明,對於x86架構的系統來說,物理內存被劃分爲:

類型 地址範圍
ZONE_DMA 前16MB內存
ZONE_NORMAL 16MB – 896MB
ZONE_HIGHMEM 896 MB以上

ZONE_DMA位於低端的內存空間,用於某些舊的ISA設備。ZONE_NORMAL的內存直接映射到Linux內核線性地址空間的高端部分。

對於x86_64架構的系統來說:

類型 地址範圍
ZONE_DMA 前16MB內存
ZONE_DMA32 16MB – 4G
ZONE_NORMAL 4G以上

和x86相比,多了ZONE_DMA32,少了ZONE_HIGHMEM.

linux如何分配內存

這裏也不詳細展開了,推薦兩篇文章:

Glibc內存管理–ptmalloc2源代碼分析

PageAllocation

結論:

1. malloc屬於glic的庫函數,分配的是虛擬地址。

2. linux的malloc分配時,如果申請內存小於MMAP_THRESHOLD(默認128K),使用brk分配,否則使用mmap分配。

3. 通過brk分配的地址空間,當堆尾的空閒內存超過M_TRIM_THRESHOLD(默認128K)時,執行內存縮緊操作,這裏指的也是虛擬地址。

4. 讀寫內存時,觸發缺頁中斷,此時纔會分配物理內存。

5. 分配物理內存時,kernel由high zone到low zone依次查找是否有足夠的內存可以分配,找到可用的內存後映射到虛擬地址上。

6. 關於系統分配內存的詳細介紹,可以參考:

Memory Mapping and DMA

lowmem_reserve_ratio

這裏主要是對vm.txt的解釋,建議看原文

爲什麼要調整lowmem_reserve_ratio

在有高端內存的機器上,從低端內存域給應用層進程分配內存是很危險的,因爲這些內存可以通過mlock()系統鎖定,或者變成不可用的swap空間。

在有大量高端內存的機器上,缺少可以回收的低端內存是致命的。因此如果可以使用高端內存,Linux頁面分配器不會使用低端內存。這意味着,內核會保護一定數量的低端內存,避免被用戶空間鎖定。“lowmem_reserve_ratio”參數可以調整內核對於lower zone的保護力度。

lowmem_reserve_ratio參數的含義

lowmem_reserve_ratio是一個數組,可以通過以下命令查看:

% cat /proc/sys/vm/lowmem_reserve_ratio    
256     256     32    

數組的長度=內存zone數量-1,其中每個數並不是絕對值,而是一個比例,代表1/256或1/32。

再次搬出zoneinfo,這裏以zone_dma和zone_dma32舉例:

$ cat /proc/zoneinfo    
Node 0, zone      DMA    
  pages free     3933    
        min      20    
        low      25    
        high     30    
        scanned  0    
        spanned  4095    
        present  3834    
        protection: (0, 3179, 7976, 7976)    
Node 0, zone    DMA32    
  pages free     639908    
        min      4456    
        low      5570    
        high     6684    
        scanned  0    
        spanned  1044480    
        present  813848    
        protection: (0, 0, 4797, 4797)    
……………………    

linux嘗試在zone中分配page時,會判斷當前zone的page_free與高位zone的page_present的關係。

例如在dma中嘗試申請dma32的page時,會計算一個protection值:

protection[dma,dma32] = zone_dma32.present/lowmem_reserve_ratio[dma(1)] = 813848/256 = 3179,這個結果對應上面DMA段中,protection數組的第二個元素。

然後需要比較zone_dma.free的值(3933) 與 protectiondma,dma32 + zone_dma.watermarkhigh的大小:

如果free>protection+watermark[high],則可以分配page;否則不能分配,內核繼續查找下一個lower zone。

也就是說只有在higher zone內存不足時纔會嘗試從lower zone繼續申請。

更詳細的文檔可以參考:

Documentation/sysctl/vm.txt

根據公式可以看出:

lowmem_reserve_ratio越大,低級的zone中被保護的內存就越小;

lowmem_reserve_ratio越小,低級的zone中被保護的內存就越大;

當lowmem_reserve_ratio=1(100%)時代表對low zone的最大保護強度。

lowmem_reserve_ratio

For some specialised workloads on highmem machines it is dangerous for    
the kernel to allow process memory to be allocated from the "lowmem"    
zone.  This is because that memory could then be pinned via the mlock()    
system call, or by unavailability of swapspace.    
    
And on large highmem machines this lack of reclaimable lowmem memory    
can be fatal.    
    
So the Linux page allocator has a mechanism which prevents allocations    
which _could_ use highmem from using too much lowmem.  This means that    
a certain amount of lowmem is defended from the possibility of being    
captured into pinned user memory.    
    
(The same argument applies to the old 16 megabyte ISA DMA region.  This    
mechanism will also defend that region from allocations which could use    
highmem or lowmem).    
    
The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is    
in defending these lower zones.    
    
If you have a machine which uses highmem or ISA DMA and your    
applications are using mlock(), or if you are running with no swap then    
you probably should change the lowmem_reserve_ratio setting.    
    
The lowmem_reserve_ratio is an array. You can see them by reading this file.    
-    
% cat /proc/sys/vm/lowmem_reserve_ratio    
256     256     32    
-    
Note: # of this elements is one fewer than number of zones. Because the highest    
      zone's value is not necessary for following calculation.    
    
But, these values are not used directly. The kernel calculates # of protection    
pages for each zones from them. These are shown as array of protection pages    
in /proc/zoneinfo like followings. (This is an example of x86-64 box).    
Each zone has an array of protection pages like this.    
    
-    
Node 0, zone      DMA    
  pages free     1355    
        min      3    
        low      3    
        high     4    
        :    
        :    
    numa_other   0    
        protection: (0, 2004, 2004, 2004)    
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    
  pagesets    
    cpu: 0 pcp: 0    
        :    
-    
These protections are added to score to judge whether this zone should be used    
for page allocation or should be reclaimed.    
    
In this example, if normal pages (index=2) are required to this DMA zone and    
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should    
not be used because pages_free(1355) is smaller than watermark + protection[2]    
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for    
normal page requirement. If requirement is DMA zone(index=0), protection[0]    
(=0) is used.    
    
zone[i]'s protection[j] is calculated by following expression.    
    
(i < j):    
  zone[i]->protection[j]    
  = (total sums of present_pages from zone[i+1] to zone[j] on the node)    
    / lowmem_reserve_ratio[i];    
(i = j):    
   (should not be protected. = 0;    
(i > j):    
   (not necessary, but looks 0)    
    
The default values of lowmem_reserve_ratio[i] are    
    256 (if zone[i] means DMA or DMA32 zone)    
    32  (others).    
As above expression, they are reciprocal number of ratio.    
256 means 1/256. # of protection pages becomes about "0.39%" of total present    
pages of higher zones on the node.    
    
If you would like to protect more pages, smaller values are effective.    
The minimum value is 1 (1/1 -> 100%).    

Linux老版本lowmem_reserve參數

摘自 http://blog.csdn.net/kickxxx/article/details/8835733

2.6內核的zone結構中一個成員變量 lowmem_reserve

struct zone {    
    /* Fields commonly accessed by the page allocator */    
    
    /* zone watermarks, access with *_wmark_pages(zone) macros */    
    unsigned long watermark[NR_WMARK];    
    
    /*   
     * We don't know if the memory that we're going to allocate will be freeable   
     * or/and it will be released eventually, so to avoid totally wasting several   
     * GB of ram we must reserve some of the lower zone memory (otherwise we risk   
     * to run OOM on the lower zones despite there's tons of freeable ram   
     * on the higher zones). This array is recalculated at runtime if the   
     * sysctl_lowmem_reserve_ratio sysctl changes.   
     */    
    unsigned long       lowmem_reserve[MAX_NR_ZONES];     

kernel在分配內存時,可能會涉及到多個zone,分配會嘗試從zonelist第一個zone分配,如果失敗就會嘗試下一個低級的zone(這裏的低級僅僅指zone內存的位置,實際上低地址zone是更稀缺的資源)。我們可以想像應用進程通過內存映射申請Highmem 並且加mlock分配,如果此時Highmem zone無法滿足分配,則會嘗試從Normal進行分配。這就有一個問題,來自Highmem的請求可能會耗盡Normal zone的內存,而且由於mlock又無法回收,最終的結果就是Normal zone無內存提供給kernel的正常分配,而Highmem有大把的可回收內存無法有效利用。

因此針對這個case,使得Normal zone在碰到來自Highmem的分配請求時,可以通過lowmem_reserve聲明:可以使用我的內存,但是必須要保留lowmem_reserve[NORMAL]給我自己使用。

同樣當從Normal失敗後,會嘗試從zonelist中的DMA申請分配,通過lowmem_reserve[DMA],限制來自HIGHMEM和Normal的分配請求。

/*   
 * results with 256, 32 in the lowmem_reserve sysctl:   
 *  1G machine -> (16M dma, 800M-16M normal, 1G-800M high)   
 *  1G machine -> (16M dma, 784M normal, 224M high)   
 *  NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA   
 *  HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL   
 *  HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA   
 *   
 * TBD: should special case ZONE_DMA32 machines here - in those we normally   
 * don't need any ZONE_NORMAL reservation   
 */    
 #ifdef CONFIG_ZONE_DMA    
     256,    
#endif    
#ifdef CONFIG_ZONE_DMA32    
     256,    
#endif    
#ifdef CONFIG_HIGHMEM    
     32,    
#endif    
     32,    
};    

如果不希望低級zone被較高級分配使用,那麼可以設置係數爲1,得到最大的保護效果

不過這個值的計算非常的奇怪,來自NORMAL的分配,lowmem_reserve[DMA] = normal_size / ratio,這裏使用Normal zone size而不是DMA zone size,這點一直沒有想明白。

此外,較新的內核源碼目錄中/Documentation/sysctl/vm.txt,對lowmem_reserve做了非常準確的描述。

GP 官方建議的OS 參數調整

The sysctl.conf parameters listed in this topic are for performance, optimization, and consistency in a
wide variety of environments. Change these settings according to your specific situation and setup.
Set the parameters in the /etc/sysctl.conf file and reload with sysctl -p:
# kernel.shmall = _PHYS_PAGES / 2 # See Shared Memory Pages
kernel.shmall = 4000000000
# kernel.shmmax = kernel.shmall * PAGE_SIZE
kernel.shmmax = 500000000
kernel.shmmni = 4096
vm.overcommit_memory = 2 # See Segment Host Memory
vm.overcommit_ratio = 95 # See Segment Host Memory
net.ipv4.ip_local_port_range = 10000 65535 # See Port Settings
kernel.sem = 500 2048000 200 40960
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152
vm.swappiness = 10

vm.zone_reclaim_mode = 0

vm.dirty_expire_centisecs = 500
vm.dirty_writeback_centisecs = 100
vm.dirty_background_ratio = 0 # See System Memory
vm.dirty_ratio = 0
vm.dirty_background_bytes = 1610612736
vm.dirty_bytes = 4294967296

可以看到官方建議將vm.zone_reclaim_mode = 0 設置爲0。

參考:

https://developer.aliyun.com/article/228285?scm=20140722.184.2.173

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章