不同CPU性能大PK

前言

比較Hygon7280、Intel、AMD、鯤鵬920、飛騰2500的性能情況

CPU型號	Hygon 7280	AMD 7H12	AMD 7T83	Intel 8163	鯤鵬920	飛騰2500	倚天710
物理核數	32	32	64	24	48	64	128core
超線程	2	2	2	2
路	2	2	2	2	2	2	1
NUMA Node	8	2	4	2	4	16	2
L1d	32K	32K	32K	32K	64K	32K	64K
L2	512K	512K	512K	1024K	512K	2048K	1024K

AMD 7T83 有8個Die, 每個Die L3大小 32M，L2 大小4MiB, 每個Die上 L1I/L1D 各256KiB，每個Die有8core，2、3代都是帶有獨立 IO Die
倚天710是一路服務器，單芯片2塊對稱的 Die

參與比較的幾款CPU參數

IPC的說明：

IPC: insns per cycle insn/cycles 也就是每個時鐘週期能執行的指令數量，越大程序跑的越快

程序的執行時間 = 指令數/(主頻*IPC) //單核下，多核的話再除以核數

Hygon 7280

Hygon 7280 就是AMD Zen架構，最大IPC能到5.

架構： x86_64
CPU 運行模式： 32-bit, 64-bit
字節序： Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU: 128
在線 CPU 列表： 0-127
每個核的線程數： 2
每個座的核數： 32
座： 2
NUMA 節點： 8
廠商 ID： HygonGenuine
CPU 系列： 24
型號： 1
型號名稱： Hygon C86 7280 32-core Processor
步進： 1
CPU MHz： 2194.586
BogoMIPS： 3999.63
虛擬化： AMD-V
L1d 緩存： 2 MiB
L1i 緩存： 4 MiB
L2 緩存： 32 MiB
L3 緩存： 128 MiB
NUMA 節點0 CPU： 0-7,64-71
NUMA 節點1 CPU： 8-15,72-79
NUMA 節點2 CPU： 16-23,80-87
NUMA 節點3 CPU： 24-31,88-95
NUMA 節點4 CPU： 32-39,96-103
NUMA 節點5 CPU： 40-47,104-111
NUMA 節點6 CPU： 48-55,112-119
NUMA 節點7 CPU： 56-63,120-127

架構說明：

每個CPU有4個Die，每個Die有兩個CCX（2 core-Complexes），每個CCX最多有4core（例如7280/7285）共享一個L3 cache；每個Die有兩個Memory Channel，每個CPU帶有8個Memory Channel，並且每個Memory Channel最多支持2根Memory；

海光7系列架構圖：

曙光H620-G30A 機型硬件結構，CPU是hygon 7280（截圖只截取了Socket0）

AMD EPYC 7T83(NC)

兩路服務器，4 numa node，Z3架構

詳細信息：

#lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7T83 64-Core Processor
Stepping: 1
CPU MHz: 2154.005
CPU max MHz: 2550.0000
CPU min MHz: 1500.0000
BogoMIPS: 5090.93
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-31,128-159
NUMA node1 CPU(s): 32-63,160-191
NUMA node2 CPU(s): 64-95,192-223
NUMA node3 CPU(s): 96-127,224-255
 
#cat /sys/devices/system/cpu/cpu{0,1,8,16,30,31,32,128}/cache/index3/shared_cpu_map
00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff
00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff
00000000,00000000,00000000,0000ff00,00000000,00000000,00000000,0000ff00
00000000,00000000,00000000,00ff0000,00000000,00000000,00000000,00ff0000
00000000,00000000,00000000,ff000000,00000000,00000000,00000000,ff000000
00000000,00000000,00000000,ff000000,00000000,00000000,00000000,ff000000
00000000,00000000,000000ff,00000000,00000000,00000000,000000ff,00000000
00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff
 
#cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map
00000000,00000000,00000000,00000001,00000000,00000000,00000000,00000001

L3是8個物理核，16個超線程共享，相當於單核2MB，一塊CPU有8個L3，總共是256MB

#cat cpu0/cache/index3/shared_cpu_list
0-7,128-135
#cat cpu0/cache/index3/size
32768K
#cat cpu0/cache/index2/shared_cpu_list
0,128
 
#cat /sys/devices/system/cpu/cpu{0,1,8,16,30,31,32,128}/cache/index3/shared_cpu_list
0-7,128-135
0-7,128-135
8-15,136-143
16-23,144-151
24-31,152-159
24-31,152-159
32-39,160-167
0-7,128-135

L1D、L1I各爲 2MiB，單物理核爲32KB

空跑nop的IPC爲6（有點嚇人）

#perf stat ./cpu/test
Performance counter stats for process id '449650':
 
2,574.29 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
0 page-faults # 0.000 K/sec
8,985,622,182 cycles # 3.491 GHz (83.33%)
4,390,929 stalled-cycles-frontend # 0.05% frontend cycles idle (83.34%)
4,387,560,442 stalled-cycles-backend # 48.83% backend cycles idle (83.34%)
53,711,907,863 instructions # 5.98 insn per cycle
# 0.08 stalled cycles per insn (83.34%)
418,902,363 branches # 162.725 M/sec (83.34%)
15,036 branch-misses # 0.00% of all branches (83.32%)
 
2.574347594 seconds time elapsed

sysbench 測試7T83 比7H12 略好，可能是ECS、OS等帶來的差異。

測試環境：4.19.91-011.ali4000.alios7.x86_64，5.7.34-log MySQL Community Server (GPL)

測試核數	AMD EPYC 7H12 2.5G（QPS、IPC）	說明
單核	24363 0.58	CPU跑滿
一對HT	33519 0.40	CPU跑滿
2物理核(0-1)	48423 0.57	CPU跑滿
2物理核(0,32) 跨node	46232 0.55	CPU跑滿
2物理核(0,64) 跨socket	45072 0.52	CPU跑滿
4物理核(0-3)	97759 0.58	CPU跑滿
16物理核(0-15)	367992 0.55	CPU跑滿，sys佔比20%，si 10%
32物理核(0-31)	686998 0.51	CPU跑滿，sys佔比20%, si 12%
64物理核(0-63)	1161079 0.50	CPU跑到95%以上，sys佔比20%, si 12%
64物理核(0-31,64-95)	964441 0.49	socket2上的32核一直比較閒，數據無參考意義
64物理核(0-31,64-95)	1147846 0.48	重啓mysqld，立即綁核，sysbench 在32-63上，導致0-31的CPU只能跑到89%

說明，壓測過程動態通過taskset綁核，所以會有數據殘留其它核的cache問題

跨socket taskset綁核的時候要壓很久任務纔會跨socket遷移過去，也就是剛taskset後CPU是跑不滿的

#numastat -p 437803
 
Per-node process memory usage (in MBs) for PID 437803 (mysqld)
Node 0 Node 1 Node 2
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 1.15 0.00 5403.27
Stack 0.00 0.00 0.09
Private 1921.60 16.22 10647.66
---------------- --------------- --------------- ---------------
Total 1922.75 16.22 16051.02
 
Node 3 Total
--------------- ---------------
Huge 0.00 0.00
Heap 0.03 5404.45
Stack 0.00 0.09
Private 16.20 12601.68
---------------- --------------- ---------------
Total 16.23 18006.22

AMD EPYC 7H12(ECS)

AMD EPYC 7H12 64-Core（ECS，非物理機），最大IPC能到5.

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
座： 2
NUMA 節點： 2
廠商 ID： AuthenticAMD
CPU 系列： 23
型號： 49
型號名稱： AMD EPYC 7H12 64-Core Processor
步進： 0
CPU MHz： 2595.124
BogoMIPS： 5190.24
虛擬化： AMD-V
超管理器廠商： KVM
虛擬化類型： 完全
L1d 緩存： 32K
L1i 緩存： 32K
L2 緩存： 512K
L3 緩存： 16384K
NUMA 節點0 CPU： 0-31
NUMA 節點1 CPU： 32-63

AMD EPYC 7T83 ECS

[root@bugu88 cpu0]# cd /sys/devices/system/cpu/cpu0
[root@bugu88 cpu0]# cat cache/index0/size
32K
[root@bugu88 cpu0]# cat cache/index1/size
32K
[root@bugu88 cpu0]# cat cache/index2/size
512K
[root@bugu88 cpu0]# cat cache/index3/size
32768K
[root@bugu88 cpu0]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
座： 1
NUMA 節點： 1
廠商 ID： AuthenticAMD
CPU 系列： 25
型號： 1
型號名稱： AMD EPYC 7T83 64-Core Processor
步進： 1
CPU MHz： 2545.218
BogoMIPS： 5090.43
超管理器廠商： KVM
虛擬化類型： 完全
L1d 緩存： 32K
L1i 緩存： 32K
L2 緩存： 512K
L3 緩存： 32768K
NUMA 節點0 CPU： 0-15

stream：

[root@bugu88 lmbench-master]# for i in $(seq 0 15); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
0
STREAM copy latency: 0.68 nanoseconds
STREAM copy bandwidth: 23509.84 MB/sec
STREAM scale latency: 0.69 nanoseconds
STREAM scale bandwidth: 23285.51 MB/sec
STREAM add latency: 0.96 nanoseconds
STREAM add bandwidth: 25043.73 MB/sec
STREAM triad latency: 1.40 nanoseconds
STREAM triad bandwidth: 17121.79 MB/sec
1
STREAM copy latency: 0.68 nanoseconds
STREAM copy bandwidth: 23513.96 MB/sec
STREAM scale latency: 0.68 nanoseconds
STREAM scale bandwidth: 23580.06 MB/sec
STREAM add latency: 0.96 nanoseconds
STREAM add bandwidth: 25049.96 MB/sec
STREAM triad latency: 1.35 nanoseconds
STREAM triad bandwidth: 17741.93 MB/sec

Intel 8163

這次對比測試的Intel 8163 CPU信息如下，最大IPC 是4：

#lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
Stepping: 4
CPU MHz: 2499.121
CPU max MHz: 3100.0000
CPU min MHz: 1000.0000
BogoMIPS: 4998.90
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 33792K
NUMA node0 CPU(s): 0-95
 
-----8269CY
#lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 104
On-line CPU(s) list: 0-103
Thread(s) per core: 2
Core(s) per socket: 26
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping: 7
CPU MHz: 3200.000
CPU max MHz: 3800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4998.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-25,52-77
NUMA node1 CPU(s): 26-51,78-103

不同 intel 型號的差異

如下圖是8269CY和E5-2682上跑的MySQL在相同業務、相同流量下的差異：

CPU使用率差異(下圖8051C是E5-2682，其它是 8269CY，主頻也有30%的差異)

鯤鵬920

[root@ARM 19:15 /root/lmbench3]
#numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 192832 MB
node 0 free: 146830 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 193533 MB
node 1 free: 175354 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 2 size: 193533 MB
node 2 free: 175718 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 3 size: 193532 MB
node 3 free: 183643 MB
node distances:
node 0 1 2 3
0: 10 12 20 22
1: 12 10 22 24
2: 20 22 10 12
3: 22 24 12 10
 
#lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Model: 0
CPU max MHz: 2600.0000
CPU min MHz: 200.0000
BogoMIPS: 200.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 24576K
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

飛騰2500

飛騰2500用nop去跑IPC的話，只能到1，但是跑其它代碼能到2.33

#lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 16
Model: 3
BogoMIPS: 100.00
L1d cache: 32K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 65536K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63
NUMA node8 CPU(s): 64-71
NUMA node9 CPU(s): 72-79
NUMA node10 CPU(s): 80-87
NUMA node11 CPU(s): 88-95
NUMA node12 CPU(s): 96-103
NUMA node13 CPU(s): 104-111
NUMA node14 CPU(s): 112-119
NUMA node15 CPU(s): 120-127
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
 
#perf stat ./nop
failed to read counter stalled-cycles-frontend
failed to read counter stalled-cycles-backend
failed to read counter branches
 
Performance counter stats for './nop':
 
78638.700540 task-clock (msec) # 0.999 CPUs utilized
1479 context-switches # 0.019 K/sec
55 cpu-migrations # 0.001 K/sec
37 page-faults # 0.000 K/sec
165127619524 cycles # 2.100 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
165269372437 instructions # 1.00 insns per cycle
<not supported> branches
3057191 branch-misses # 0.00% of all branches
 
78.692839007 seconds time elapsed
 
#dmidecode -t processor
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.2.0 present.
# SMBIOS implementations newer than version 3.0 are not
# fully supported by this version of dmidecode.
 
Handle 0x0004, DMI type 4, 48 bytes
Processor Information
Socket Designation: BGA3576
Type: Central Processor
Family: <OUT OF SPEC>
Manufacturer: PHYTIUM
ID: 00 00 00 00 70 1F 66 22
Version: S2500
Voltage: 0.8 V
External Clock: 50 MHz
Max Speed: 2100 MHz
Current Speed: 2100 MHz
Status: Populated, Enabled
Upgrade: Other
L1 Cache Handle: 0x0005
L2 Cache Handle: 0x0007
L3 Cache Handle: 0x0008
Serial Number: N/A
Asset Tag: No Asset Tag
Part Number: NULL
Core Count: 64
Core Enabled: 64
Thread Count: 64
Characteristics:
64-bit capable
Multi-Core
Hardware Thread
Execute Protection
Enhanced Virtualization
Power/Performance Control

其它

2Die，2node

#lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 128
Socket(s): 1
NUMA node(s): 2
Model: 0
BogoMIPS: 100.00
L1d cache: 64K
L1i cache: 64K
L2 cache: 1024K
L3 cache: 65536K //64core share
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh
 
#cat cpu{0,1,8,16,30,31,32,127}/cache/index3/shared_cpu_list
0-63
0-63
0-63
0-63
0-63
0-63
0-63
64-127
 
#grep -E "core|64.000" lat.log
core:0
64.00000 59.653
core:8
64.00000 62.265
core:16
64.00000 59.411
core:24
64.00000 55.836
core:32
64.00000 55.909
core:40
64.00000 56.176
core:48
64.00000 57.240
core:56
64.00000 59.485
core:64
64.00000 131.818
core:72
64.00000 127.182
core:80
64.00000 122.452
core:88
64.00000 121.673
core:96
64.00000 126.533
core:104
64.00000 125.673
core:112
64.00000 124.188
core:120
64.00000 130.202
 
#numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 515652 MB
node 0 free: 514913 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 516086 MB
node 1 free: 514815 MB
node distances:
node 0 1
0: 10 15
1: 15 10

單核以及HT計算Prime性能比較

以上兩款CPU但從物理上的指標來看似乎AMD要好很多，從工藝上AMD也要領先一代(2年），從單核參數上來說是2.0 VS 2.5GHz，但是IPC 是5 VS 4，算下來理想的單核性能剛好一致（25=2.5 4）。

從外面的一些跑分結果顯示也是AMD 要好，但是實際性能怎麼樣呢？

測試命令，這個測試命令無論在哪個CPU下，用2個物理核用時都是一個物理核的一半，所以這個計算是可以完全並行的

1	taskset -c 1 /usr/bin/sysbench --num-threads=1 --test=cpu --cpu-max-prime=50000 run //單核用一個threads，綁核; HT用2個threads，綁一對HT

測試結果爲耗時，單位秒

測試項	AMD EPYC 7H12 2.5G CentOS 7.9	Hygon 7280 2.1GHz CentOS	Hygon 7280 2.1GHz 麒麟	Intel 8269 2.50G	Intel 8163 CPU @ 2.50GHz	Intel E5-2682 v4 @ 2.50GHz
單核 prime 50000 耗時	59秒 IPC 0.56	77秒 IPC 0.55	89秒 IPC 0.56;	83 0.41	105秒 IPC 0.41	109秒 IPC 0.39
HT prime 50000 耗時	57秒 IPC 0.31	74秒 IPC 0.29	87秒 IPC 0.29	48 0.35	60秒 IPC 0.36	74秒 IPC 0.29

相同CPU下的指令數基本= 耗時 IPC 核數

以上測試結果顯示Hygon 7280單核計算能力是要強過Intel 8163的，但是超線程在這個場景下太不給力，相當於沒有。

當然上面的計算Prime太單純了，代表不了複雜的業務場景，所以接下來用MySQL的查詢場景來看看。

如果是arm芯片在計算prime上明顯要好過x86，猜測是除法取餘指令上有優化

1 2	#taskset -c 11 sysbench cpu --threads=1 --events=50000 run sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

測試結果爲10秒鐘的event

測試項	FT2500 2.1G	鯤鵬920-4826 2.6GHz	Intel 8163 CPU @ 2.50GHz	Hygon C86 7280 2.1GHz	AMD 7T83
單核 prime 10秒 events	21626 IPC 0.89	30299 IPC 1.01	8435 IPC 0.41	10349 IPC 0.63	40112 IPC 1.38

對比MySQL sysbench和tpcc性能

分別將MySQL 5.7.34社區版部署到intel+AliOS以及hygon 7280+CentOS上，將mysqld綁定到單核，一樣的壓力配置均將CPU跑到100%，然後用sysbench測試點查， HT表示將mysqld綁定到一對HT核。

sysbench點查

測試命令類似如下：

sysbench --test='/usr/share/doc/sysbench/tests/db/select.lua' --oltp_tables_count=1 --report-interval=1 --oltp-table-size=10000000 --mysql-port=3307 --mysql-db=sysbench_single --mysql-user=root --mysql-password='Bj6f9g96!@#' --max-requests=0 --oltp_skip_trx=on --oltp_auto_inc=on --oltp_range_size=5 --mysql-table-engine=innodb --rand-init=on --max-time=300 --mysql-host=x86.51 --num-threads=4 run

測試結果(測試中的差異AMD、Hygon CPU跑在CentOS7.9， intel CPU、Kunpeng 920 跑在AliOS上, xdb表示用集團的xdb替換社區的MySQL Server，麒麟是國產OS)：

測試核數	AMD EPYC 7H12 2.5G	Hygon 7280 2.1G	Hygon 7280 2.1GHz 麒麟	Intel 8269 2.50G	Intel 8163 2.50G	Intel 8163 2.50G XDB5.7	鯤鵬 920-4826 2.6G	鯤鵬 920-4826 2.6G XDB8.0	FT2500 alisql 8.0 本地–socket
單核	24674 0.54	13441 0.46	10236 0.39	28208 0.75	25474 0.84	29376 0.89	9694 0.49	8301 0.46	3602 0.53
一對HT	36157 0.42	21747 0.38	19417 0.37	36754 0.49	35894 0.6	40601 0.65	無HT	無HT	無HT
4物理核	94132 0.52	49822 0.46	38033 0.37	90434 0.69 350%	87254 0.73	106472 0.83	34686 0.42	28407 0.39	14232 0.53
16物理核	325409 0.48	171630 0.38	134980 0.34	371718 0.69 1500%	332967 0.72	446290 0.85 //16核比4核好！	116122 0.35	94697 0.33	59199 0.6 8core:31210 0.59
32物理核	542192 0.43	298716 0.37	255586 0.33	642548 0.64 2700%	588318 0.67	598637 0.81 CPU 2400%	228601 0.36	177424 0.32	114020 0.65

麒麟OS下CPU很難跑滿，大致能跑到90%-95%左右，麒麟上裝的社區版MySQL-5.7.29；飛騰要特別注意mysqld所在socket，同時以上飛騰數據都是走–sock壓測所得，32core走網絡壓測QPS爲：99496（15%的網絡損耗）[^說明]

Mysqld 二進制代碼所在 page cache帶來的性能影響

如果是飛騰跨socket影響很大，mysqld二進制跨socket性能會下降30%以上

對於鯤鵬920，雙路服務器上測試，mysqld綁在node0, 但是分別將mysqld二進制load進不同的node上的page cache，然後執行點查

mysqld	node0	node1	node2	node3
QPS	190120 IPC 0.40	182518 IPC 0.39	189046 IPC 0.40	186533 IPC 0.40

以上數據可以看出這裏node0到node1還是很慢的，居然比跨socket還慢，反過來說鯤鵬跨socket性能很好

綁定mysqld到不同node的page cache操作

#systemctl stop mysql-server
 
[root@poc65 /root/vmtouch]
#vmtouch -e /usr/local/mysql/bin/mysqld
Files: 1
Directories: 0
Evicted Pages: 5916 (23M)
Elapsed: 0.00322 seconds
 
#vmtouch -v /usr/local/mysql/bin/mysqld
/usr/local/mysql/bin/mysqld
[ ] 0/5916
 
Files: 1
Directories: 0
Resident Pages: 0/5916 0/23M 0%
Elapsed: 0.000204 seconds
 
#taskset -c 24 md5sum /usr/local/mysql/bin/mysqld
 
#grep mysqld /proc/`pidof mysqld`/numa_maps //檢查mysqld具體綁定在哪個node上
00400000 default file=/usr/local/mysql/bin/mysqld mapped=3392 active=1 N0=3392 kernelpagesize_kB=4
0199b000 default file=/usr/local/mysql/bin/mysqld anon=10 dirty=10 mapped=134 active=10 N0=134 kernelpagesize_kB=4
01a70000 default file=/usr/local/mysql/bin/mysqld anon=43 dirty=43 mapped=120 active=43 N0=120 kernelpagesize_kB=4

網卡以及node距離帶來的性能差異

在鯤鵬920+mysql5.7+alios，將內存分配鎖在node0上，然後分別綁核在1、24、48、72core，進行sysbench點查對比

	Core1	Core24	Core48	Core72
QPS	10800	10400	7700	7700

以上測試的時候業務進程分配的內存全限制在node0上（下面的網卡中斷測試也是同樣內存結構）

#/root/numa-maps-summary.pl </proc/123853/numa_maps
N0 : 5085548 ( 19.40 GB)
N1 : 4479 ( 0.02 GB)
N2 : 1 ( 0.00 GB)
active : 0 ( 0.00 GB)
anon : 5085455 ( 19.40 GB)
dirty : 5085455 ( 19.40 GB)
kernelpagesize_kB: 2176 ( 0.01 GB)
mapmax : 348 ( 0.00 GB)
mapped : 4626 ( 0.02 GB)

對比測試，將內存鎖在node3上，重複進行以上測試結果如下：

	Core1	Core24	Core48	Core72
QPS	10500	10000	8100	8000

#/root/numa-maps-summary.pl </proc/54478/numa_maps
N0 : 16 ( 0.00 GB)
N1 : 4401 ( 0.02 GB)
N2 : 1 ( 0.00 GB)
N3 : 1779989 ( 6.79 GB)
active : 0 ( 0.00 GB)
anon : 1779912 ( 6.79 GB)
dirty : 1779912 ( 6.79 GB)
kernelpagesize_kB: 1108 ( 0.00 GB)
mapmax : 334 ( 0.00 GB)
mapped : 4548 ( 0.02 GB)

機器上網卡eth1插在node0上，由以上兩組對比測試發現網卡影響比內存跨node影響更大，網卡影響有20%。而內存的影響基本看不到（就近好那麼一點點，但是不明顯，只能解釋爲cache命中率很高了）

此時軟中斷都在node0上，如果將軟中斷綁定到node3上，第72core的QPS能提升到8500，並且非常穩定。同時core0的QPS下降到10000附近。

網卡軟中斷以及網卡遠近的測試結論

測試機器只是用了一塊網卡，網卡插在node0上。

一般網卡中斷會佔用一些CPU，如果把網卡中斷挪到其它node的core上，在鯤鵬920上測試，業務跑在node3（使用全部24core），網卡中斷分別在node0和node3，QPS分別是：179000 VS 175000 （此時把中斷放到node0或者是和node3最近的node2上差別不大）

如果將業務跑在node0上（全部24core），網卡中斷分別在node0和node1上得到的QPS分別是：204000 VS 212000

tpcc 1000倉

測試結果(測試中Hygon 7280分別跑在CentOS7.9和麒麟上，鯤鵬/intel CPU 跑在AliOS、麒麟是國產OS)：

tpcc測試數據，結果爲1000倉，tpmC (NewOrders) ，未標註CPU 則爲跑滿了

測試核數	Intel 8269 2.50G	Intel 8163 2.50G	Hygon 7280 2.1GHz 麒麟	Hygon 7280 2.1G CentOS 7.9	鯤鵬 920-4826 2.6G	鯤鵬 920-4826 2.6G XDB8.0
1物理核	12392	9902	4706	7011	6619	4653
一對HT	17892	15324	8950	11778	無HT	無HT
4物理核	51525	40877	19387 380%	30046	23959	20101
8物理核	100792	81799	39664 750%	60086	42368	40572
16物理核	160798 抖動	140488 CPU抖動	75013 1400%	106419 1300-1550%	70581 1200%	79844
24物理核	188051	164757 1600-2100%	100841 1800-2000%	130815 1600-2100%	88204 1600%	115355
32物理核	195292	185171 2000-2500%	116071 1900-2400%	142746 1800-2400%	102089 1900%	143567
48物理核	19969l	195730 2100-2600%	128188 2100-2800%	149782 2000-2700%	116374 2500%	206055 4500%

tpcc併發到一定程度後主要是鎖導致性能上不去，所以超多核意義不大。

如果在Hygon 7280 2.1GHz 麒麟上起兩個MySQLD實例，每個實例各綁定32物理core，性能剛好翻倍：

測試過程CPU均跑滿（未跑滿的話會標註出來），IPC跑不起來性能就必然低，超線程雖然總性能好了但是會導致IPC降低(參考前面的公式)。可以看到對本來IPC比較低的場景，啓用超線程後一般對性能會提升更大一些。

CPU核數增加到32核後，MySQL社區版性能追平xdb，此時sysbench使用120線程壓性能較好（AMD得240線程壓）

32核的時候對比下MySQL 社區版在Hygon7280和Intel 8163下的表現：

三款CPU的性能指標

測試項	AMD EPYC 7H12 2.5G	Hygon 7280 2.1GHz	Intel 8163 CPU @ 2.50GHz
內存帶寬(MiB/s)	12190.50	6206.06	7474.45
內存延時(遍歷很大一個數組)	0.334ms	0.336ms	0.429ms

在lmbench上的測試數據

stream主要用於測試帶寬，對應的時延是在帶寬跑滿情況下的帶寬。

lat_mem_rd用來測試操作不同數據大小的時延。總的來說帶寬看stream、時延看lat_mem_rd

飛騰2500

用stream測試帶寬和latency，可以看到帶寬隨着numa距離不斷減少、對應的latency不斷增加，到最近的numa node有10%的損耗，這個損耗和numactl給出的距離完全一致。跨socket訪問內存latency是node內的3倍，帶寬是三分之一，但是socket1性能和socket0性能完全一致

time for i in $(seq 7 8 128); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
 
#numactl -C 7 -m 0 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 2.84 nanoseconds
STREAM copy bandwidth: 5638.21 MB/sec
STREAM scale latency: 2.72 nanoseconds
STREAM scale bandwidth: 5885.97 MB/sec
STREAM add latency: 2.26 nanoseconds
STREAM add bandwidth: 10615.13 MB/sec
STREAM triad latency: 4.53 nanoseconds
STREAM triad bandwidth: 5297.93 MB/sec
 
#numactl -C 7 -m 1 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 3.16 nanoseconds
STREAM copy bandwidth: 5058.71 MB/sec
STREAM scale latency: 3.15 nanoseconds
STREAM scale bandwidth: 5074.78 MB/sec
STREAM add latency: 2.35 nanoseconds
STREAM add bandwidth: 10197.36 MB/sec
STREAM triad latency: 5.12 nanoseconds
STREAM triad bandwidth: 4686.37 MB/sec
 
#numactl -C 7 -m 2 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 3.85 nanoseconds
STREAM copy bandwidth: 4150.98 MB/sec
STREAM scale latency: 3.95 nanoseconds
STREAM scale bandwidth: 4054.30 MB/sec
STREAM add latency: 2.64 nanoseconds
STREAM add bandwidth: 9100.12 MB/sec
STREAM triad latency: 6.39 nanoseconds
STREAM triad bandwidth: 3757.70 MB/sec
 
#numactl -C 7 -m 3 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 3.69 nanoseconds
STREAM copy bandwidth: 4340.24 MB/sec
STREAM scale latency: 3.62 nanoseconds
STREAM scale bandwidth: 4422.18 MB/sec
STREAM add latency: 2.47 nanoseconds
STREAM add bandwidth: 9704.82 MB/sec
STREAM triad latency: 5.74 nanoseconds
STREAM triad bandwidth: 4177.85 MB/sec
 
[[email protected] /root/lmbench3]
#numactl -C 7 -m 7 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 3.95 nanoseconds
STREAM copy bandwidth: 4051.51 MB/sec
STREAM scale latency: 3.94 nanoseconds
STREAM scale bandwidth: 4060.63 MB/sec
STREAM add latency: 2.54 nanoseconds
STREAM add bandwidth: 9434.51 MB/sec
STREAM triad latency: 6.13 nanoseconds
STREAM triad bandwidth: 3913.36 MB/sec
 
[[email protected] /root/lmbench3]
#numactl -C 7 -m 10 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 8.80 nanoseconds
STREAM copy bandwidth: 1817.78 MB/sec
STREAM scale latency: 8.59 nanoseconds
STREAM scale bandwidth: 1861.65 MB/sec
STREAM add latency: 5.55 nanoseconds
STREAM add bandwidth: 4320.68 MB/sec
STREAM triad latency: 13.94 nanoseconds
STREAM triad bandwidth: 1721.76 MB/sec
 
[[email protected] /root/lmbench3]
#numactl -C 7 -m 11 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 9.27 nanoseconds
STREAM copy bandwidth: 1726.52 MB/sec
STREAM scale latency: 9.31 nanoseconds
STREAM scale bandwidth: 1718.10 MB/sec
STREAM add latency: 5.65 nanoseconds
STREAM add bandwidth: 4250.89 MB/sec
STREAM triad latency: 14.09 nanoseconds
STREAM triad bandwidth: 1703.66 MB/sec
 
[[email protected] /root/lmbench3]
#numactl -C 88 -m 11 ./bin/stream -W 5 -N 5 -M 64M //在另外一個socket上測試本numa，和node0性能完全一致
STREAM copy latency: 2.93 nanoseconds
STREAM copy bandwidth: 5454.67 MB/sec
STREAM scale latency: 2.96 nanoseconds
STREAM scale bandwidth: 5400.03 MB/sec
STREAM add latency: 2.28 nanoseconds
STREAM add bandwidth: 10543.42 MB/sec
STREAM triad latency: 4.52 nanoseconds
STREAM triad bandwidth: 5308.40 MB/sec
 
[[email protected] /root/lmbench3]
#numactl -C 7 -m 15 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 8.73 nanoseconds
STREAM copy bandwidth: 1831.77 MB/sec
STREAM scale latency: 8.81 nanoseconds
STREAM scale bandwidth: 1815.13 MB/sec
STREAM add latency: 5.63 nanoseconds
STREAM add bandwidth: 4265.21 MB/sec
STREAM triad latency: 13.09 nanoseconds
STREAM triad bandwidth: 1833.68 MB/sec

Lat_mem_rd 用cpu7訪問node0和node15對比結果，隨着數據的加大，延時在加大，64M時能有3倍差距，和上面測試一致

下圖第一列表示讀寫數據的大小（單位M），第二列表示訪問延時（單位納秒），一般可以看到在L1/L2/L3 cache大小的地方延時會有跳躍，遠超過L3大小後，延時就是內存延時了

1	numactl -C 7 -m 0 ./bin/lat_mem_rd -W 5 -N 5 -t 64M //-C 7 cpu 7, -m 0 node0, -W 熱身 -t stride

同樣的機型，開關numa的測試結果，關numa 時延、帶寬都差了幾倍

關閉numa的機器上測試結果隨機性很強，這應該是和內存分配在那裏有關係，不過如果機器一直保持這個狀態反覆測試的話，快的core一直快，慢的core一直慢，這是因爲物理地址分配有一定的規律，在物理內存沒怎麼變化的情況下，快的core恰好分到的內存比較近。

同時不同機器狀態（內存使用率）測試結果也不一樣

鯤鵬920

#for i in $(seq 0 15); do echo core:$i; numactl -N $i -m 7 ./bin/stream -W 5 -N 5 -M 64M; done
STREAM copy latency: 1.84 nanoseconds
STREAM copy bandwidth: 8700.75 MB/sec
STREAM scale latency: 1.86 nanoseconds
STREAM scale bandwidth: 8623.60 MB/sec
STREAM add latency: 2.18 nanoseconds
STREAM add bandwidth: 10987.04 MB/sec
STREAM triad latency: 3.03 nanoseconds
STREAM triad bandwidth: 7926.87 MB/sec
 
#numactl -C 7 -m 1 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 2.05 nanoseconds
STREAM copy bandwidth: 7802.45 MB/sec
STREAM scale latency: 2.08 nanoseconds
STREAM scale bandwidth: 7681.87 MB/sec
STREAM add latency: 2.19 nanoseconds
STREAM add bandwidth: 10954.76 MB/sec
STREAM triad latency: 3.17 nanoseconds
STREAM triad bandwidth: 7559.86 MB/sec
 
#numactl -C 7 -m 2 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 3.51 nanoseconds
STREAM copy bandwidth: 4556.86 MB/sec
STREAM scale latency: 3.58 nanoseconds
STREAM scale bandwidth: 4463.66 MB/sec
STREAM add latency: 2.71 nanoseconds
STREAM add bandwidth: 8869.79 MB/sec
STREAM triad latency: 5.92 nanoseconds
STREAM triad bandwidth: 4057.12 MB/sec
 
#numactl -C 7 -m 3 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 3.94 nanoseconds
STREAM copy bandwidth: 4064.25 MB/sec
STREAM scale latency: 3.82 nanoseconds
STREAM scale bandwidth: 4188.67 MB/sec
STREAM add latency: 2.86 nanoseconds
STREAM add bandwidth: 8390.70 MB/sec
STREAM triad latency: 4.78 nanoseconds
STREAM triad bandwidth: 5024.25 MB/sec
 
#numactl -C 24 -m 3 ./bin/stream -W 5 -N 5 -M 64M
STREAM copy latency: 4.10 nanoseconds
STREAM copy bandwidth: 3904.63 MB/sec
STREAM scale latency: 4.03 nanoseconds
STREAM scale bandwidth: 3969.41 MB/sec
STREAM add latency: 3.07 nanoseconds
STREAM add bandwidth: 7816.08 MB/sec
STREAM triad latency: 5.06 nanoseconds
STREAM triad bandwidth: 4738.66 MB/sec

海光7280

可以看到跨numa（一個numa也就是一個socket，等同於跨socket）RT從1.5上升到2.5，這個數據比鯤鵬920要好很多

[root@hygon8 14:32 /root/lmbench-master]
#lscpu
架構： x86_64
CPU 運行模式： 32-bit, 64-bit
字節序： Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU: 128
在線 CPU 列表： 0-127
每個核的線程數： 2
每個座的核數： 32
座： 2
NUMA 節點： 8
廠商 ID： HygonGenuine
CPU 系列： 24
型號： 1
型號名稱： Hygon C86 7280 32-core Processor
步進： 1
CPU MHz： 2194.586
BogoMIPS： 3999.63
虛擬化： AMD-V
L1d 緩存： 2 MiB
L1i 緩存： 4 MiB
L2 緩存： 32 MiB
L3 緩存： 128 MiB
NUMA 節點0 CPU： 0-7,64-71
NUMA 節點1 CPU： 8-15,72-79
NUMA 節點2 CPU： 16-23,80-87
NUMA 節點3 CPU： 24-31,88-95
NUMA 節點4 CPU： 32-39,96-103
NUMA 節點5 CPU： 40-47,104-111
NUMA 節點6 CPU： 48-55,112-119
NUMA 節點7 CPU： 56-63,120-127
 
//可以看到7號core比15、23、31號core明顯要快，就近訪問node 0的內存，跨numa node（跨Die）沒有內存交織分配
[root@hygon8 14:32 /root/lmbench-master]
#time for i in $(seq 7 8 64); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
7
STREAM copy latency: 1.38 nanoseconds
STREAM copy bandwidth: 11559.53 MB/sec
STREAM scale latency: 1.16 nanoseconds
STREAM scale bandwidth: 13815.87 MB/sec
STREAM add latency: 1.40 nanoseconds
STREAM add bandwidth: 17145.85 MB/sec
STREAM triad latency: 1.44 nanoseconds
STREAM triad bandwidth: 16637.18 MB/sec
15
STREAM copy latency: 1.67 nanoseconds
STREAM copy bandwidth: 9591.77 MB/sec
STREAM scale latency: 1.56 nanoseconds
STREAM scale bandwidth: 10242.50 MB/sec
STREAM add latency: 1.45 nanoseconds
STREAM add bandwidth: 16581.00 MB/sec
STREAM triad latency: 2.00 nanoseconds
STREAM triad bandwidth: 12028.83 MB/sec
23
STREAM copy latency: 1.65 nanoseconds
STREAM copy bandwidth: 9701.49 MB/sec
STREAM scale latency: 1.53 nanoseconds
STREAM scale bandwidth: 10427.98 MB/sec
STREAM add latency: 1.42 nanoseconds
STREAM add bandwidth: 16846.10 MB/sec
STREAM triad latency: 1.97 nanoseconds
STREAM triad bandwidth: 12189.72 MB/sec
31
STREAM copy latency: 1.64 nanoseconds
STREAM copy bandwidth: 9742.86 MB/sec
STREAM scale latency: 1.52 nanoseconds
STREAM scale bandwidth: 10510.80 MB/sec
STREAM add latency: 1.45 nanoseconds
STREAM add bandwidth: 16559.86 MB/sec
STREAM triad latency: 1.92 nanoseconds
STREAM triad bandwidth: 12490.01 MB/sec
39
STREAM copy latency: 2.55 nanoseconds
STREAM copy bandwidth: 6286.25 MB/sec
STREAM scale latency: 2.51 nanoseconds
STREAM scale bandwidth: 6383.11 MB/sec
STREAM add latency: 1.76 nanoseconds
STREAM add bandwidth: 13660.83 MB/sec
STREAM triad latency: 3.68 nanoseconds
STREAM triad bandwidth: 6523.02 MB/sec

如果這種芯片在bios裏設置Die interleaving，4塊die當成一個numa node吐出來給OS

#lscpu
架構： x86_64
CPU 運行模式： 32-bit, 64-bit
字節序： Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU: 128
在線 CPU 列表： 0-127
每個核的線程數： 2
每個座的核數： 32
座： 2
NUMA 節點： 2
廠商 ID： HygonGenuine
CPU 系列： 24
型號： 1
型號名稱： Hygon C86 7280 32-core Processor
步進： 1
CPU MHz： 2108.234
BogoMIPS： 3999.45
虛擬化： AMD-V
L1d 緩存： 2 MiB
L1i 緩存： 4 MiB
L2 緩存： 32 MiB
L3 緩存： 128 MiB
//注意這裏和真實物理架構不一致，bios配置了Die Interleaving Enable
//表示每路內多個Die內存交織分配，這樣整個一路就是一個大Die
NUMA 節點0 CPU： 0-31,64-95
NUMA 節點1 CPU： 32-63,96-127
 
 
//enable die interleaving 後繼續streaming測試
//最終測試結果表現就是7/15/23/31 core性能一致，因爲默認一個numa內內存交織分配
//可以看到同一路下的四個die內存交織訪問，所以4個node內存延時一樣了（被平均），都不如就近快
[root@hygon3 16:09 /root/lmbench-master]
#time for i in $(seq 7 8 64); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
7
STREAM copy latency: 1.48 nanoseconds
STREAM copy bandwidth: 10782.58 MB/sec
STREAM scale latency: 1.20 nanoseconds
STREAM scale bandwidth: 13364.38 MB/sec
STREAM add latency: 1.46 nanoseconds
STREAM add bandwidth: 16408.32 MB/sec
STREAM triad latency: 1.53 nanoseconds
STREAM triad bandwidth: 15696.00 MB/sec
15
STREAM copy latency: 1.51 nanoseconds
STREAM copy bandwidth: 10601.25 MB/sec
STREAM scale latency: 1.24 nanoseconds
STREAM scale bandwidth: 12855.87 MB/sec
STREAM add latency: 1.46 nanoseconds
STREAM add bandwidth: 16382.42 MB/sec
STREAM triad latency: 1.53 nanoseconds
STREAM triad bandwidth: 15691.48 MB/sec
23
STREAM copy latency: 1.50 nanoseconds
STREAM copy bandwidth: 10700.61 MB/sec
STREAM scale latency: 1.27 nanoseconds
STREAM scale bandwidth: 12634.63 MB/sec
STREAM add latency: 1.47 nanoseconds
STREAM add bandwidth: 16370.67 MB/sec
STREAM triad latency: 1.55 nanoseconds
STREAM triad bandwidth: 15455.75 MB/sec
31
STREAM copy latency: 1.50 nanoseconds
STREAM copy bandwidth: 10637.39 MB/sec
STREAM scale latency: 1.25 nanoseconds
STREAM scale bandwidth: 12778.99 MB/sec
STREAM add latency: 1.46 nanoseconds
STREAM add bandwidth: 16420.65 MB/sec
STREAM triad latency: 1.61 nanoseconds
STREAM triad bandwidth: 14946.80 MB/sec
39
STREAM copy latency: 2.35 nanoseconds
STREAM copy bandwidth: 6807.09 MB/sec
STREAM scale latency: 2.32 nanoseconds
STREAM scale bandwidth: 6906.93 MB/sec
STREAM add latency: 1.63 nanoseconds
STREAM add bandwidth: 14729.23 MB/sec
STREAM triad latency: 3.36 nanoseconds
STREAM triad bandwidth: 7151.67 MB/sec
47
STREAM copy latency: 2.31 nanoseconds
STREAM copy bandwidth: 6938.47 MB/sec

以華爲泰山服務器(鯤鵬920芯片)配置爲例：

Die Interleaving 控制是否使能DIE交織。使能DIE交織能充分利用系統的DDR帶寬，並儘量保證各DDR通道的帶寬均衡，提升DDR的利用率

hygon5280測試數據

-----hygon5280測試數據
[root@localhost lmbench-master]# for i in $(seq 0 8 24); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
0
STREAM copy latency: 1.22 nanoseconds
STREAM copy bandwidth: 13166.34 MB/sec
STREAM scale latency: 1.13 nanoseconds
STREAM scale bandwidth: 14166.95 MB/sec
STREAM add latency: 1.15 nanoseconds
STREAM add bandwidth: 20818.63 MB/sec
STREAM triad latency: 1.39 nanoseconds
STREAM triad bandwidth: 17211.81 MB/sec
8
STREAM copy latency: 1.56 nanoseconds
STREAM copy bandwidth: 10273.07 MB/sec
STREAM scale latency: 1.50 nanoseconds
STREAM scale bandwidth: 10701.89 MB/sec
STREAM add latency: 1.20 nanoseconds
STREAM add bandwidth: 19996.68 MB/sec
STREAM triad latency: 1.93 nanoseconds
STREAM triad bandwidth: 12443.70 MB/sec
16
STREAM copy latency: 2.52 nanoseconds
STREAM copy bandwidth: 6357.71 MB/sec
STREAM scale latency: 2.48 nanoseconds
STREAM scale bandwidth: 6454.95 MB/sec
STREAM add latency: 1.67 nanoseconds
STREAM add bandwidth: 14362.51 MB/sec
STREAM triad latency: 3.65 nanoseconds
STREAM triad bandwidth: 6572.85 MB/sec
24
STREAM copy latency: 2.44 nanoseconds
STREAM copy bandwidth: 6554.24 MB/sec
STREAM scale latency: 2.41 nanoseconds
STREAM scale bandwidth: 6642.80 MB/sec
STREAM add latency: 1.44 nanoseconds
STREAM add bandwidth: 16695.82 MB/sec
STREAM triad latency: 3.61 nanoseconds
STREAM triad bandwidth: 6639.18 MB/sec
[root@localhost lmbench-master]# lscpu
架構： x86_64
CPU 運行模式： 32-bit, 64-bit
字節序： Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU: 64
在線 CPU 列表： 0-63
每個核的線程數： 2
每個座的核數： 16
座： 2
NUMA 節點： 4
廠商 ID： HygonGenuine
CPU 系列： 24
型號： 1
型號名稱： Hygon C86 5280 16-core Processor
步進： 1
Frequency boost: enabled
CPU MHz： 2799.311
CPU 最大 MHz： 2500.0000
CPU 最小 MHz： 1600.0000
BogoMIPS： 4999.36
虛擬化： AMD-V
L1d 緩存： 1 MiB
L1i 緩存： 2 MiB
L2 緩存： 16 MiB
L3 緩存： 64 MiB
NUMA 節點0 CPU： 0-7,32-39
NUMA 節點1 CPU： 8-15,40-47
NUMA 節點2 CPU： 16-23,48-55
NUMA 節點3 CPU： 24-31,56-63
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB
filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
標記： fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse3
6 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdts
cp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm
aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe p
opcnt xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy
abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfct
r_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev
ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt s
ha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt
lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassist
s pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov suc
cor smca

intel 8269CY

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 104
On-line CPU(s) list: 0-103
Thread(s) per core: 2
Core(s) per socket: 26
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping: 7
CPU MHz: 3200.000
CPU max MHz: 3800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4998.89
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-25,52-77
NUMA node1 CPU(s): 26-51,78-103
 
[[email protected] /home/ren/lmbench3]
#time for i in $(seq 0 8 51); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
0
STREAM copy latency: 1.15 nanoseconds
STREAM copy bandwidth: 13941.80 MB/sec
STREAM scale latency: 1.16 nanoseconds
STREAM scale bandwidth: 13799.89 MB/sec
STREAM add latency: 1.31 nanoseconds
STREAM add bandwidth: 18318.23 MB/sec
STREAM triad latency: 1.56 nanoseconds
STREAM triad bandwidth: 15356.72 MB/sec
16
STREAM copy latency: 1.12 nanoseconds
STREAM copy bandwidth: 14293.68 MB/sec
STREAM scale latency: 1.13 nanoseconds
STREAM scale bandwidth: 14162.47 MB/sec
STREAM add latency: 1.31 nanoseconds
STREAM add bandwidth: 18293.27 MB/sec
STREAM triad latency: 1.53 nanoseconds
STREAM triad bandwidth: 15692.47 MB/sec
32
STREAM copy latency: 1.52 nanoseconds
STREAM copy bandwidth: 10551.71 MB/sec
STREAM scale latency: 1.52 nanoseconds
STREAM scale bandwidth: 10508.33 MB/sec
STREAM add latency: 1.38 nanoseconds
STREAM add bandwidth: 17363.22 MB/sec
STREAM triad latency: 2.00 nanoseconds
STREAM triad bandwidth: 12024.52 MB/sec
40
STREAM copy latency: 1.49 nanoseconds
STREAM copy bandwidth: 10758.50 MB/sec
STREAM scale latency: 1.50 nanoseconds
STREAM scale bandwidth: 10680.17 MB/sec
STREAM add latency: 1.34 nanoseconds
STREAM add bandwidth: 17948.34 MB/sec
STREAM triad latency: 1.98 nanoseconds
STREAM triad bandwidth: 12133.22 MB/sec
48
STREAM copy latency: 1.49 nanoseconds
STREAM copy bandwidth: 10736.56 MB/sec
STREAM scale latency: 1.50 nanoseconds
STREAM scale bandwidth: 10692.93 MB/sec
STREAM add latency: 1.34 nanoseconds
STREAM add bandwidth: 17902.85 MB/sec
STREAM triad latency: 1.96 nanoseconds
STREAM triad bandwidth: 12239.44 MB/sec

Intel(R) Xeon(R) CPU E5-2682 v4

#time for i in $(seq 0 8 51); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
0
STREAM copy latency: 1.59 nanoseconds
STREAM copy bandwidth: 10092.31 MB/sec
STREAM scale latency: 1.57 nanoseconds
STREAM scale bandwidth: 10169.16 MB/sec
STREAM add latency: 1.31 nanoseconds
STREAM add bandwidth: 18360.83 MB/sec
STREAM triad latency: 2.28 nanoseconds
STREAM triad bandwidth: 10503.81 MB/sec
8
STREAM copy latency: 1.55 nanoseconds
STREAM copy bandwidth: 10312.14 MB/sec
STREAM scale latency: 1.56 nanoseconds
STREAM scale bandwidth: 10283.70 MB/sec
STREAM add latency: 1.30 nanoseconds
STREAM add bandwidth: 18416.26 MB/sec
STREAM triad latency: 2.23 nanoseconds
STREAM triad bandwidth: 10777.08 MB/sec
16
STREAM copy latency: 2.02 nanoseconds
STREAM copy bandwidth: 7914.25 MB/sec
STREAM scale latency: 2.02 nanoseconds
STREAM scale bandwidth: 7919.85 MB/sec
STREAM add latency: 1.39 nanoseconds
STREAM add bandwidth: 17276.06 MB/sec
STREAM triad latency: 2.92 nanoseconds
STREAM triad bandwidth: 8231.18 MB/sec
24
STREAM copy latency: 1.99 nanoseconds
STREAM copy bandwidth: 8032.18 MB/sec
STREAM scale latency: 1.98 nanoseconds
STREAM scale bandwidth: 8061.12 MB/sec
STREAM add latency: 1.39 nanoseconds
STREAM add bandwidth: 17313.94 MB/sec
STREAM triad latency: 2.88 nanoseconds
STREAM triad bandwidth: 8318.93 MB/sec
 
#lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Stepping: 1
CPU MHz: 2500.000
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 5000.06
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 40960K
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63

AMD EPYC 7T83

#time for i in $(seq 0 8 255); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
0
STREAM copy latency: 0.49 nanoseconds
STREAM copy bandwidth: 32561.30 MB/sec
STREAM scale latency: 0.49 nanoseconds
STREAM scale bandwidth: 32620.66 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27575.20 MB/sec
STREAM triad latency: 0.70 nanoseconds
STREAM triad bandwidth: 34397.15 MB/sec
8
STREAM copy latency: 0.52 nanoseconds
STREAM copy bandwidth: 30764.47 MB/sec
STREAM scale latency: 0.53 nanoseconds
STREAM scale bandwidth: 30056.59 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27575.20 MB/sec
STREAM triad latency: 0.69 nanoseconds
STREAM triad bandwidth: 34789.45 MB/sec
16
STREAM copy latency: 0.53 nanoseconds
STREAM copy bandwidth: 30173.15 MB/sec
STREAM scale latency: 0.54 nanoseconds
STREAM scale bandwidth: 29895.91 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27496.11 MB/sec
STREAM triad latency: 0.70 nanoseconds
STREAM triad bandwidth: 34128.93 MB/sec
24
STREAM copy latency: 0.78 nanoseconds
STREAM copy bandwidth: 20417.69 MB/sec
STREAM scale latency: 0.51 nanoseconds
STREAM scale bandwidth: 31354.70 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27548.79 MB/sec
STREAM triad latency: 0.69 nanoseconds
STREAM triad bandwidth: 34589.22 MB/sec
32
STREAM copy latency: 0.60 nanoseconds
STREAM copy bandwidth: 26862.34 MB/sec
STREAM scale latency: 0.58 nanoseconds
STREAM scale bandwidth: 27376.00 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27518.66 MB/sec
STREAM triad latency: 0.78 nanoseconds
STREAM triad bandwidth: 30779.17 MB/sec
40
STREAM copy latency: 0.59 nanoseconds
STREAM copy bandwidth: 27230.21 MB/sec
STREAM scale latency: 0.59 nanoseconds
STREAM scale bandwidth: 27284.18 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27503.63 MB/sec
STREAM triad latency: 0.77 nanoseconds
STREAM triad bandwidth: 31242.48 MB/sec
48
STREAM copy latency: 0.59 nanoseconds
STREAM copy bandwidth: 27102.37 MB/sec
STREAM scale latency: 0.59 nanoseconds
STREAM scale bandwidth: 27164.08 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27503.63 MB/sec
STREAM triad latency: 0.76 nanoseconds
STREAM triad bandwidth: 31422.90 MB/sec
56
STREAM copy latency: 0.92 nanoseconds
STREAM copy bandwidth: 17453.54 MB/sec
STREAM scale latency: 0.59 nanoseconds
STREAM scale bandwidth: 27267.55 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27488.61 MB/sec
STREAM triad latency: 0.77 nanoseconds
STREAM triad bandwidth: 31169.92 MB/sec
64
STREAM copy latency: 0.88 nanoseconds
STREAM copy bandwidth: 18231.15 MB/sec
STREAM scale latency: 0.84 nanoseconds
STREAM scale bandwidth: 18976.06 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26413.87 MB/sec
STREAM triad latency: 1.08 nanoseconds
STREAM triad bandwidth: 22310.12 MB/sec
72
STREAM copy latency: 0.86 nanoseconds
STREAM copy bandwidth: 18552.45 MB/sec
STREAM scale latency: 0.84 nanoseconds
STREAM scale bandwidth: 19113.88 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26375.81 MB/sec
STREAM triad latency: 1.08 nanoseconds
STREAM triad bandwidth: 22151.79 MB/sec
80
STREAM copy latency: 0.89 nanoseconds
STREAM copy bandwidth: 18037.59 MB/sec
STREAM scale latency: 0.87 nanoseconds
STREAM scale bandwidth: 18398.59 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 26142.91 MB/sec
STREAM triad latency: 1.08 nanoseconds
STREAM triad bandwidth: 22133.53 MB/sec
88
STREAM copy latency: 0.93 nanoseconds
STREAM copy bandwidth: 17119.60 MB/sec
STREAM scale latency: 0.94 nanoseconds
STREAM scale bandwidth: 17030.54 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 26146.30 MB/sec
STREAM triad latency: 1.08 nanoseconds
STREAM triad bandwidth: 22159.10 MB/sec
96
STREAM copy latency: 1.39 nanoseconds
STREAM copy bandwidth: 11512.93 MB/sec
STREAM scale latency: 0.87 nanoseconds
STREAM scale bandwidth: 18406.16 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 25991.03 MB/sec
STREAM triad latency: 1.09 nanoseconds
STREAM triad bandwidth: 22078.91 MB/sec
104
STREAM copy latency: 0.86 nanoseconds
STREAM copy bandwidth: 18546.04 MB/sec
STREAM scale latency: 1.39 nanoseconds
STREAM scale bandwidth: 11518.85 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26300.01 MB/sec
STREAM triad latency: 1.06 nanoseconds
STREAM triad bandwidth: 22599.38 MB/sec
112
STREAM copy latency: 0.88 nanoseconds
STREAM copy bandwidth: 18253.46 MB/sec
STREAM scale latency: 0.85 nanoseconds
STREAM scale bandwidth: 18758.59 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26413.87 MB/sec
STREAM triad latency: 1.06 nanoseconds
STREAM triad bandwidth: 22648.95 MB/sec
120
STREAM copy latency: 0.86 nanoseconds
STREAM copy bandwidth: 18607.75 MB/sec
STREAM scale latency: 0.84 nanoseconds
STREAM scale bandwidth: 18957.30 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26427.74 MB/sec
STREAM triad latency: 1.08 nanoseconds
STREAM triad bandwidth: 22313.83 MB/sec
128
STREAM copy latency: 0.82 nanoseconds
STREAM copy bandwidth: 19432.13 MB/sec
STREAM scale latency: 0.87 nanoseconds
STREAM scale bandwidth: 18421.31 MB/sec
STREAM add latency: 0.98 nanoseconds
STREAM add bandwidth: 24546.03 MB/sec
STREAM triad latency: 1.06 nanoseconds
STREAM triad bandwidth: 22702.59 MB/sec
136
STREAM copy latency: 0.74 nanoseconds
STREAM copy bandwidth: 21568.01 MB/sec
STREAM scale latency: 0.74 nanoseconds
STREAM scale bandwidth: 21668.99 MB/sec
STREAM add latency: 0.90 nanoseconds
STREAM add bandwidth: 26697.59 MB/sec
STREAM triad latency: 0.91 nanoseconds
STREAM triad bandwidth: 26320.64 MB/sec
144
STREAM copy latency: 0.79 nanoseconds
STREAM copy bandwidth: 20268.45 MB/sec
STREAM scale latency: 0.66 nanoseconds
STREAM scale bandwidth: 24279.61 MB/sec
STREAM add latency: 0.89 nanoseconds
STREAM add bandwidth: 26822.08 MB/sec
STREAM triad latency: 0.84 nanoseconds
STREAM triad bandwidth: 28540.76 MB/sec
152
STREAM copy latency: 0.85 nanoseconds
STREAM copy bandwidth: 18903.90 MB/sec
STREAM scale latency: 0.56 nanoseconds
STREAM scale bandwidth: 28734.25 MB/sec
STREAM add latency: 0.88 nanoseconds
STREAM add bandwidth: 27335.58 MB/sec
STREAM triad latency: 0.75 nanoseconds
STREAM triad bandwidth: 31911.01 MB/sec
160
STREAM copy latency: 0.64 nanoseconds
STREAM copy bandwidth: 25068.68 MB/sec
STREAM scale latency: 0.63 nanoseconds
STREAM scale bandwidth: 25550.68 MB/sec
STREAM add latency: 0.88 nanoseconds
STREAM add bandwidth: 27313.33 MB/sec
STREAM triad latency: 0.82 nanoseconds
STREAM triad bandwidth: 29416.50 MB/sec
168
STREAM copy latency: 0.61 nanoseconds
STREAM copy bandwidth: 26232.33 MB/sec
STREAM scale latency: 0.60 nanoseconds
STREAM scale bandwidth: 26717.96 MB/sec
STREAM add latency: 0.88 nanoseconds
STREAM add bandwidth: 27398.82 MB/sec
STREAM triad latency: 0.79 nanoseconds
STREAM triad bandwidth: 30411.86 MB/sec
176
STREAM copy latency: 0.58 nanoseconds
STREAM copy bandwidth: 27380.19 MB/sec
STREAM scale latency: 0.58 nanoseconds
STREAM scale bandwidth: 27740.96 MB/sec
STREAM add latency: 0.94 nanoseconds
STREAM add bandwidth: 25666.31 MB/sec
STREAM triad latency: 0.77 nanoseconds
STREAM triad bandwidth: 31150.63 MB/sec
184
STREAM copy latency: 0.90 nanoseconds
STREAM copy bandwidth: 17730.21 MB/sec
STREAM scale latency: 0.57 nanoseconds
STREAM scale bandwidth: 27918.40 MB/sec
STREAM add latency: 0.87 nanoseconds
STREAM add bandwidth: 27458.61 MB/sec
STREAM triad latency: 0.76 nanoseconds
STREAM triad bandwidth: 31457.27 MB/sec
192
STREAM copy latency: 0.91 nanoseconds
STREAM copy bandwidth: 17558.57 MB/sec
STREAM scale latency: 0.88 nanoseconds
STREAM scale bandwidth: 18115.49 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 26031.36 MB/sec
STREAM triad latency: 1.12 nanoseconds
STREAM triad bandwidth: 21443.95 MB/sec
200
STREAM copy latency: 1.34 nanoseconds
STREAM copy bandwidth: 11911.40 MB/sec
STREAM scale latency: 0.85 nanoseconds
STREAM scale bandwidth: 18893.26 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26306.88 MB/sec
STREAM triad latency: 1.09 nanoseconds
STREAM triad bandwidth: 22013.73 MB/sec
208
STREAM copy latency: 1.36 nanoseconds
STREAM copy bandwidth: 11724.12 MB/sec
STREAM scale latency: 0.86 nanoseconds
STREAM scale bandwidth: 18631.00 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 26166.69 MB/sec
STREAM triad latency: 1.10 nanoseconds
STREAM triad bandwidth: 21763.86 MB/sec
216
STREAM copy latency: 0.88 nanoseconds
STREAM copy bandwidth: 18270.85 MB/sec
STREAM scale latency: 0.85 nanoseconds
STREAM scale bandwidth: 18848.15 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 26176.90 MB/sec
STREAM triad latency: 1.10 nanoseconds
STREAM triad bandwidth: 21799.20 MB/sec
224
STREAM copy latency: 0.89 nanoseconds
STREAM copy bandwidth: 18047.29 MB/sec
STREAM scale latency: 0.86 nanoseconds
STREAM scale bandwidth: 18677.66 MB/sec
STREAM add latency: 0.92 nanoseconds
STREAM add bandwidth: 26112.39 MB/sec
STREAM triad latency: 1.09 nanoseconds
STREAM triad bandwidth: 21966.89 MB/sec
232
STREAM copy latency: 1.35 nanoseconds
STREAM copy bandwidth: 11818.58 MB/sec
STREAM scale latency: 0.82 nanoseconds
STREAM scale bandwidth: 19568.11 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26469.44 MB/sec
STREAM triad latency: 1.06 nanoseconds
STREAM triad bandwidth: 22702.59 MB/sec
240
STREAM copy latency: 0.87 nanoseconds
STREAM copy bandwidth: 18325.74 MB/sec
STREAM scale latency: 0.83 nanoseconds
STREAM scale bandwidth: 19331.37 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26455.52 MB/sec
STREAM triad latency: 1.06 nanoseconds
STREAM triad bandwidth: 22580.37 MB/sec
248
STREAM copy latency: 0.87 nanoseconds
STREAM copy bandwidth: 18418.79 MB/sec
STREAM scale latency: 0.84 nanoseconds
STREAM scale bandwidth: 19019.09 MB/sec
STREAM add latency: 0.91 nanoseconds
STREAM add bandwidth: 26483.37 MB/sec
STREAM triad latency: 1.08 nanoseconds
STREAM triad bandwidth: 22148.13 MB/sec

stream對比數據

總結下幾個CPU用stream測試訪問內存的RT以及抖動和帶寬對比數據，重點關注帶寬，這個測試中時延不重要

	最小RT	最大RT	最大copy bandwidth	最小copy bandwidth
申威3231(2numa node)	7.09	8.75	2256.59 MB/sec	1827.88 MB/sec
飛騰2500(16 numa node)	2.84	10.34	5638.21 MB/sec	1546.68 MB/sec
鯤鵬920(4 numa node)	1.84	3.87	8700.75 MB/sec	4131.81 MB/sec
海光7280(8 numa node)	1.38	2.58	11591.48 MB/sec	6206.99 MB/sec
海光5280(4 numa node)	1.22	2.52	13166.34 MB/sec	6357.71 MB/sec
Intel8269CY(2 numa node)	1.12	1.52	14293.68 MB/sec	10551.71 MB/sec
Intel E5-2682(2 numa node)	1.58	2.02	10092.31 MB/sec	7914.25 MB/sec
AMD EPYC 7T83(4 numa node)	0.49	1.39	32561.30 MB/sec	11512.93 MB/sec
Y	1.83	3.48	8764.72 MB/sec	4593.25 MB/sec

從以上數據可以看出這5款CPU性能一款比一款好，飛騰2500慢的core上延時快到intel 8269的10倍了，平均延時5倍以上了。延時數據基本和單核上測試sysbench TPS一致。性能差不多就是：常數*主頻/RT

lat_mem_rd對比數據

用不同的node上的core 跑lat_mem_rd測試訪問node0內存的RT，只取最大64M的時延，時延和node距離完全一致

	RT變化
飛騰2500(16 numa node)	core:0 149.976 core:8 168.805 core:16 191.415 core:24 178.283 core:32 170.814 core:40 185.699 core:48 212.281 core:56 202.479 core:64 426.176 core:72 444.367 core:80 465.894 core:88 452.245 core:96 448.352 core:104 460.603 core:112 485.989 core:120 490.402
鯤鵬920(4 numa node)	core:0 117.323 core:24 135.337 core:48 197.782 core:72 219.416
海光7280(8 numa node)	numa0 106.839 numa1 168.583 numa2 163.925 numa3 163.690 numa4 289.628 numa5 288.632 numa6 236.615 numa7 291.880 分割行 enabled die interleaving core:0 153.005 core:16 152.458 core:32 272.057 core:48 269.441
海光5280(4 numa node)	core:0 102.574 core:8 160.989 core:16 286.850 core:24 231.197
海光7260(1 numa node)	core:0 265
Intel 8269CY(2 numa node)	core:0 69.792 core:26 93.107
申威3231(2numa node)	core:0 215.146 core:32 282.443
AMD EPYC 7T83(4 numa node)	core:0 71.656 core:32 80.129 core:64 131.334 core:96 129.563
Y7（2Die，2node，1socket）	core:8 42.395 core:40 36.434 core:104 105.745 core:88 124.384 core:24 62.979 core:8 69.324 core:64 137.233 core:88 127.250 133ns 205ns （待測）

測試命令：

1	for i in $(seq 0 8 127); do echo core:$i; numactl -C $i -m 0 ./bin/lat_mem_rd -W 5 -N 5 -t 64M; done >lat.log 2>&1

測試結果和numactl -H 看到的node distance完全一致，芯片廠家應該就是這樣測試然後把這個延遲當做距離寫進去了

AMD EPYC 7T83(4 numa node)的時延相對抖動有點大，這和架構多個小Die合併成一塊CPU有關

#grep -E "core|64.00000" lat.log
core:0
64.00000 71.656
core:32
64.00000 80.129
core:64
64.00000 131.334
core:88
64.00000 136.774
core:96
64.00000 129.563
core:120
64.00000 140.151

AMD EPYC 7T83(4 numa node)比Intel 8269時延要大，但是帶寬也高很多

龍芯測試數據

3A5000爲龍芯，執行的命令爲./lat_mem_rd 128M 4096，其中 4096 參數爲跳步大小。其基本原理是，通過按給定間隔去循環讀一定大小的內存區域，測量每個讀平均的時間。如果區域大小小於 L1 Cache 大小，時間應該接近 L1 的訪問延遲;如果大於 L1 小於 L2，則接近 L2 訪問延遲;依此類推。圖中橫坐標爲訪問的字節數，縱座標爲訪存的拍數(cycles)。

基於跳步訪問的 3A5000 和 Zen1、Skylake 各級延遲的比較(cycles)

下圖給出了 LMbench 測試得到的訪存操作的併發性，執行的命令爲./par_mem。訪存操作的並發性是各級 Cache 和內存所支持併發訪問的能力。在 LMbench 中，訪存操作併發性的測試是設計一個鏈表，不斷地遍歷訪問下一個鏈表中的元素，鏈表所跳的距離和需要測量的 Cache 容量相關，在一段時間能併發的發起對鏈表的追逐操作，也就是同時很多鏈表在遍歷，如果發現這一段時間內能同時完成 N 個鏈表的追逐操作，就認爲訪存的併發操作是 N。

下圖列出了三款處理器的功能部件操作延遲數據，使用的命令是./lat_ops。

龍芯stream數據

LMbench 包含了 STREAM 帶寬測試工具，可以用來測試可持續的內存訪問帶寬情況。圖表12.25列出了三款處理器的 STREAM 帶寬數據，其中 STREAM 數組大小設置爲 1 億個元素，採用 OpenMP 版本同時運行四個線程來測試滿載帶寬;相應測試平臺均爲 CPU 的兩個內存控制器各接一根內存條， 3A5000 和 Zen1 用 DDR4 3200 內存條，Skylake 用 DDR4 2400 內存條(它最高只支持這個規格)。

從數據可以看到，雖然硬件上 3A5000 和 Zen1 都實現了 DDR4 3200，但 3A5000 的實測可持續帶寬還是有一定差距。用戶程序看到的內存帶寬不僅僅和內存的物理頻率有關係，也和處理器內部的各種訪存隊列、內存控制器的調度策略、預取器和內存時序參數設置等相關，需要進行更多分析來定位具體的瓶頸點。像 STREAM 這樣的軟件測試工具，能夠更好地反映某個子系統的綜合能力，因而被廣泛採用。

對比結論

AMD單核跑分數據比較好
MySQL 查詢場景下Intel的性能好很多
xdb比社區版性能要好
MySQL8.0比5.7在多核鎖競爭場景下性能要好
intel最好，AMD接近Intel，海光差的比較遠但是又比鯤鵬好很多，飛騰最差，尤其是跨socket簡直是災難
麒麟OS性能也比CentOS略差一些
從perf指標來看鯤鵬920的L1d命中率高於8163是因爲鯤鵬L1 size大；L2命中率低於8163，同樣是因爲鯤鵬 L2 size小；同樣L1i 鯤鵬也大於8163，但是實際跑起來L1i Miss Rate更高，這說明 ARM對 L1d 使用效率低

整體來說AMD用領先了一代的工藝（7nm VS 14nm)，在MySQL查詢場景中終於可以接近Intel了，但是海光、鯤鵬、飛騰還是不給力。

附表

鯤鵬920 和 8163 在 MySQL 場景下的 perf 指標對比

整體對比
指標	X86	ARM	增加幅度
IPC	0.4979	0.495	-0.6%
Branchs	237606414772	415979894985	75.1%
Branch-misses	8104247620	28983836845	257.6%
Branch-missed rate	0.034	0.070	104.3%
內存讀帶寬（GB/S)	25.0	25.0	-0.2%
內存寫帶寬（GB/S)	24.6	67.8	175.5%
內存讀寫帶寬（GB/S)	49.7	92.8	86.8%
UNALIGNED_ACCESS	1329146645	13686011901	929.7%
L1d_MISS_RATIO	0.06055	0.04281	-29.3%
L1d_MISS_RATE	0.01645	0.01711	4.0%
L2_MISS_RATIO	0.34824	0.47162	35.4%
L2_MISS_RATE	0.00577	0.03493	504.8%
L1_ITLB_MISS_RATE	0.0028	0.005	78.6%
L1_DTLB_MISS_RATE	0.0025	0.0102	308.0%
context-switchs	8407195	11614981	38.2%
Pagefault	228371	741189	224.6%

參考資料

CPU的製造和概念

CPU 性能和Cache Line

Perf IPC以及CPU性能

Intel、海光、鯤鵬920、飛騰2500 CPU性能對比

飛騰ARM芯片(FT2500)的性能測試的性能測試/)

十年後數據庫還是不敢擁抱NUMA？

一次海光物理機資源競爭壓測的記錄

Intel PAUSE指令變化是如何影響自旋鎖以及MySQL的性能的

lmbench測試要考慮cache等

comment：

Intel 8163 IPC是0.67，和在PostgreSQL下測得數據基本一致。Oracle可以達到更高的IPC。從8163的perf結果中，看不出來訪存在總週期中的佔比。可以添加幾個cycle_activity.cycles_l1d_miss、cycle_activity.stalls_mem_any，看看訪存耗用的週期佔比。

作者|plantegg