內存拷貝的優化方法

http://www.blogcn.com/blog/cool/main.asp?uid=flier_lu&id=1577430
http://www.blogcn.com/blog/cool/main.asp?uid=flier_lu&id=1577440

在複雜的底層網絡程序中，內存拷貝、字符串比較和搜索操作很容易成爲性能瓶頸所在。編譯器自帶的此類函數雖然做了一些通用性的優化工作，但因爲在使用指令集方面受到兼容性的約束，遠遠沒有達到最大限度利用硬件能力的地步。而通過針對特定硬件平臺的優化，可以大大提高此類操作的性能。下面我將以P4平臺下內存拷貝操作爲例，根據AMD提供的一份優化文檔中的例子，簡要介紹一下如何通過特定指令集，優化內存帶寬的使用。雖然因爲硬件限制沒有達到AMD文檔中所說memcpy函數300%的性能提升，但在我機器上實測也有%175-%200的明顯性能提升（此數據可能根據機器情況不同）。

Optimizing Memory Bandwidth from AMD

按照衆所周知的“摩爾”定律，CPU的運算速度每18個月翻一翻，但與此同時內存和外存（硬盤）的速度並無法達到同步增長。這就造成高速CPU與相對低速的內存和外設之間的不同步發展，成爲很多程序的瓶頸所在。而如何最大限度提升對現有硬件的利用程度，是算法以下層面優化的主要途徑。對內存拷貝操作來說，瞭解和合理使用Cache是最關鍵的一點。爲追求性能，我們將以犧牲兼容性爲代價，因此以下討論和代碼都以P4及以上級別CPU爲主，AMD芯片雖然實現上有所區別，但在指令集和整體結構上相同。

首先我們來看一個最簡單的memcpy的彙編實現：

以下爲引用：

;
; Flier Lu ([email protected])
;
; nasmw.exe -f win32 fastmemcpy.asm -o fastmemcpy.obj
;
; extern "C" {
; extern void fast_memcpy1(void *dst, const void *src, size_t size);
; }
;
cpu p4

segment .text use32

global _fast_memcpy1

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy1:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len]

rep movsb

pop edi
pop esi
ret

這裏我爲了代碼可移植性，使用的是NASM格式的彙編代碼。NASM是一個非常出色的開源彙編編譯器，支持各種平臺和中間格式，被開源項目廣泛使用，這樣可以避免同時使用 VC 的嵌入式彙編和 GCC 中麻煩的 unix 風格 AT&T 格式彙編 :P

代碼初始的cpu p4定義使用p4指令集，因爲後面的很多優化工作使用了P4指令集和相關特性；接着的segment .text use32定義此代碼在32位代碼段；然後global定義標籤_fast_memcpy1爲全局符號，使得C++代碼中可以LINK其.obj後訪問此代碼；最後%define定義多個宏，用於訪問函數參數。

在C++中只需要定義fast_memcpy1函數格式並鏈接nasm編譯生成的.obj文件即可。NASM編譯時 -f 參數指定生成中間文件格式爲 MS 的 32 位 COFF 格式，-o 參數指定輸出文件名。

上面這段代碼非常簡單，適合小內存塊的快速拷貝。實際上VC編譯器在處理小內存拷貝時，會自動根據情況使用 rep movsb 直接替換 memcpy 函數，通過忽略函數調用和堆棧操作，優化代碼長度和性能。

不過在 32 位的 x86 架構下，完全沒有必要逐字節進行操作，使用 movsd 替換 movsb 是必然的選擇。

以下爲引用：

global _fast_memcpy2

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy2:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len]
shr ecx, 2 ; convert to DWORD count

rep movsd

pop edi
pop esi
ret

爲了展示方便，這裏假設源和目標內存塊本身長度都是64字節的整數倍，並且已經4K頁對齊。前者保證單條指令不會出現跨CACHE行訪問的情況；後者保證測試速度時不會因爲跨頁操作影響測試結果。等會分析CACHE時再詳細解釋爲什麼要做這種假設。

不過因爲現代CPU大多使用了很長的指令流水線，多條指令並行工作往往比一條指令效率更高，因此 AMD 文檔中給出了這樣的優化：

以下爲引用：

global _fast_memcpy3

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy3:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len]
shr ecx, 2 ; convert to DWORD count

.copyloop:
mov eax, dword [esi]
mov dword [edi], eax

add esi, 4
add edi, 4

dec ecx
jnz .copyloop

pop edi
pop esi
ret

標籤.copyloop中那段循環實際上完成跟rep movsd指令完全相同的工作，但是因爲是多條指令，理論上CPU指令流水線可以並行處理之。故而在AMD的文檔中指出能有1.5%的性能提高，不過就我實測效果不太明顯。相對而言，當年從486向pentium架構遷移時，這兩種方式的區別非常明顯。記得Delphi 3還是4中就只是通過做這一種優化，其字符串處理性能就有較大提升。而目前主流CPU廠商，實際上都是通過微代碼技術，內核中使用RISC微指令模擬CISC指令集，因此現在效果並不明顯。

然後，可以通過循環展開的優化策略，增加每次處理數據量並減少循環次數，達到性能提升目的。

以下爲引用：

global _fast_memcpy4

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy4:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len]
shr ecx, 4 ; convert to 16-byte size count

.copyloop:
mov eax, dword [esi]
mov dword [edi], eax

mov ebx, dword [esi+4]
mov dword [edi+4], ebx

mov eax, dword [esi+8]
mov dword [edi+8], eax

mov ebx, dword [esi+12]
mov dword [edi+12], ebx

add esi, 16
add edi, 16

dec ecx
jnz .copyloop

pop edi
pop esi
ret

但這種操作就 AMD 文檔上評測反而有 %1.5 性能降低，呵呵。其自己的說法是需要將讀取內存和寫入內存的操作分組，以使CPU可以一次性搞定。改稱以下分組操作就可以比_fast_memcpy3提高3% -_-b

以下爲引用：

global _fast_memcpy5

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy5:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len]
shr ecx, 4 ; convert to 16-byte size count

.copyloop:
mov eax, dword [esi]
mov ebx, dword [esi+4]
mov dword [edi], eax
mov dword [edi+4], ebx

mov eax, dword [esi+8]
mov ebx, dword [esi+12]
mov dword [edi+8], eax
mov dword [edi+12], ebx

add esi, 16
add edi, 16

dec ecx
jnz .copyloop

pop edi
pop esi
ret

可惜我在P4上實在測不出什麼區別，呵呵，大概P4和AMD實現流水線的思路有細微的出入吧 :D

既然進行循環展開，爲什麼不乾脆多展開一些呢？雖然x86下面通用寄存器只有那麼幾個，但是現在有MMX啊，呵呵，大把的寄存器啊 :D 改稱使用MMX寄存器後，一次載入/寫入操作可以處理64字節的數據，呵呵，比_fast_memcpy5可以再有7%的性能提升。

以下爲引用：

global _fast_memcpy6

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy6:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; number of QWORDS (8 bytes) assumes len / CACHEBLOCK is an integer
shr ecx, 3

lea esi, [esi+ecx*8] ; end of source
lea edi, [edi+ecx*8] ; end of destination
neg ecx ; use a negative offset as a combo pointer-and-loop-counter

.copyloop:
movq mm0, qword [esi+ecx*8]
movq mm1, qword [esi+ecx*8+8]
movq mm2, qword [esi+ecx*8+16]
movq mm3, qword [esi+ecx*8+24]
movq mm4, qword [esi+ecx*8+32]
movq mm5, qword [esi+ecx*8+40]
movq mm6, qword [esi+ecx*8+48]
movq mm7, qword [esi+ecx*8+56]

movq qword [edi+ecx*8], mm0
movq qword [edi+ecx*8+8], mm1
movq qword [edi+ecx*8+16], mm2
movq qword [edi+ecx*8+24], mm3
movq qword [edi+ecx*8+32], mm4
movq qword [edi+ecx*8+40], mm5
movq qword [edi+ecx*8+48], mm6
movq qword [edi+ecx*8+56], mm7

add ecx, 8
jnz .copyloop

emms

pop edi
pop esi

ret

優化到這個份上，常規的優化手段基本上已經用盡，需要動用非常手段了，呵呵。
讓我們回過頭來看看P4架構下的Cache結構。

The IA-32 Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide

Intel的系統變成手冊中第十章介紹了IA32架構下的內存緩存控制。因爲CPU速度和內存速度的巨大差距，CPU廠商通過在CPU中內置和外置多級緩存提高頻繁使用數據的訪問速度。一般來說，在CPU和內存之間存在L1, L2和L3三級緩存(還有幾種TLB緩存在此不涉及)，每級緩存的速度有一個數量級左右的差別，容量也有較大差別(實際上跟$有關，呵呵)，而L1緩存更是細分爲指令緩存和數據緩存，用於不同的目的。就P4和Xeon的處理器來說，L1指令緩存由Trace Cache取代，內置在NetBust微架構中；L1數據緩存和L2緩存則封裝在CPU中，根據CPU檔次不同，分別在8-16K和256-512K之間；而L3緩存只在Xeon處理器中實現，也是封裝在CPU中，512K-1M左右。
可以通過查看CPU信息的軟件如CPUInfo查看當前機器的緩存信息，如我的系統爲：
P4 1.7G, 8K L1 Code Cache, 12K L1 Data Cache, 256K L2 Cache。

而緩存在實現上是若干行(slot or line)組成的，每行對應內存中的一個地址上的連續數據，由高速緩存管理器控制讀寫中的數據載入和命中。其原理這裏不多羅嗦，有興趣的朋友可以自行查看Intel手冊。需要知道的就是每個slot的長度在P4以前是32字節，P4開始改成64字節。而對緩存行的操作都是完整進行的，哪怕只讀一個字節也需要將整個緩存行(64字節)全部載入，後面的優化很大程度上基於這些原理。

就緩存的工作模式來說，P4支持的有六種之多，這裏就不一一介紹了。對我們優化有影響的，實際上就是寫內存時緩存的表現。最常見的WT(Write-through)寫通模式在寫數據到內存的同時更新數據到緩存中；而WB(Write-back)寫回模式，則直接寫到緩存中，暫不進行較慢的內存讀寫。這兩種模式在操作頻繁操作(每秒百萬次這個級別)的內存變量處理上有較大性能差別。例如通過編寫驅動模塊操作MTRR強行打開WB模式，在Linux的網卡驅動中曾收到不錯的效果，但對內存複製的優化幫助不大，因爲我們需要的是完全跳過對緩存的操作，無論是緩存定位、載入還是寫入。

好在P4提供了MOVNTQ指令，使用WC(Write-combining)模式，跳過緩存直接寫內存。因爲我們的寫內存操作是純粹的寫，寫入的數據一定時間內根本不會被使用，無論使用WT還是WB模式，都會有冗餘的緩存操作。優化代碼如下：

以下爲引用：

global _fast_memcpy7

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy7:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; number of QWORDS (8 bytes) assumes len / CACHEBLOCK is an integer
shr ecx, 3

lea esi, [esi+ecx*8] ; end of source
lea edi, [edi+ecx*8] ; end of destination
neg ecx ; use a negative offset as a combo pointer-and-loop-counter

.copyloop:
movq mm0, qword [esi+ecx*8]
movq mm1, qword [esi+ecx*8+8]
movq mm2, qword [esi+ecx*8+16]
movq mm3, qword [esi+ecx*8+24]
movq mm4, qword [esi+ecx*8+32]
movq mm5, qword [esi+ecx*8+40]
movq mm6, qword [esi+ecx*8+48]
movq mm7, qword [esi+ecx*8+56]

movntq qword [edi+ecx*8], mm0
movntq qword [edi+ecx*8+8], mm1
movntq qword [edi+ecx*8+16], mm2
movntq qword [edi+ecx*8+24], mm3
movntq qword [edi+ecx*8+32], mm4
movntq qword [edi+ecx*8+40], mm5
movntq qword [edi+ecx*8+48], mm6
movntq qword [edi+ecx*8+56], mm7

add ecx, 8
jnz .copyloop

sfence ; flush write buffer
emms

pop edi
pop esi

ret

寫內存的movq指令全部改爲movntq指令，並在複製操作完成後，調用sfence刷新寫緩存，因爲緩存中內容可能已經失效了。這樣一來在寫內存外的載入緩存操作，以及緩存本身的操作都被省去，大大減少了冗餘內存操作。按AMD的說法能有60%的性能提升，我實測也有50%左右明顯的性能提升。
movntq和sfence等指令可以參考Intel的指令手冊：

The IA-32 Intel Architecture Software Developer's Manual, Volume 2A: Instruction Set Reference, A-M
The IA-32 Intel Architecture Software Developer's Manual, Volume 2B: Instruction Set Reference, N-Z

在優化完寫內存後，同樣可以通過對讀內存的操作進行優化提升性能。雖然CPU在讀取數據時，會有一個自動的預讀優化，但在操作連續內存區域時顯式要求CPU預讀數據，還是可以明顯地優化性能。

以下爲引用：

global _fast_memcpy8

%define param esp+8+4
%define src param+0
%define dst param+4
%define len param+8

_fast_memcpy8:
push esi
push edi

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; number of QWORDS (8 bytes) assumes len / CACHEBLOCK is an integer
shr ecx, 3

lea esi, [esi+ecx*8] ; end of source
lea edi, [edi+ecx*8] ; end of destination
neg ecx ; use a negative offset as a combo pointer-and-loop-counter

.writeloop:
prefetchnta [esi+ecx*8 + 512] ; fetch ahead by 512 bytes

movq mm0, qword [esi+ecx*8]
movq mm1, qword [esi+ecx*8+8]
movq mm2, qword [esi+ecx*8+16]
movq mm3, qword [esi+ecx*8+24]
movq mm4, qword [esi+ecx*8+32]
movq mm5, qword [esi+ecx*8+40]
movq mm6, qword [esi+ecx*8+48]
movq mm7, qword [esi+ecx*8+56]

movntq qword [edi+ecx*8], mm0
movntq qword [edi+ecx*8+8], mm1
movntq qword [edi+ecx*8+16], mm2
movntq qword [edi+ecx*8+24], mm3
movntq qword [edi+ecx*8+32], mm4
movntq qword [edi+ecx*8+40], mm5
movntq qword [edi+ecx*8+48], mm6
movntq qword [edi+ecx*8+56], mm7

add ecx, 8
jnz .writeloop

sfence ; flush write buffer
emms

pop edi
pop esi

ret

增加一個簡單的prefetchnta指令，提示CPU在處理當前讀取內存操作的同時，預讀前面512字節處的一個緩存行64字節內容。這樣一來又可以有10%左右的性能提升。
最後，對正在處理的內存，可以通過顯式的內存讀取操作，強制性要求其載入到緩存中，因爲prefetchnta指令還只是一個提示，可以被CPU忽略。這樣可以再次獲得60%左右的性能提示，我實測沒有這麼高，但是也比較明顯。

以下爲引用：

global _fast_memcpy9

%define param esp+12+4
%define src param+0
%define dst param+4
%define len param+8

%define CACHEBLOCK 400h

_fast_memcpy9:
push esi
push edi
push ebx

mov esi, [src] ; source array
mov edi, [dst] ; destination array
mov ecx, [len] ; number of QWORDS (8 bytes) assumes len / CACHEBLOCK is an integer
shr ecx, 3

lea esi, [esi+ecx*8] ; end of source
lea edi, [edi+ecx*8] ; end of destination
neg ecx ; use a negative offset as a combo pointer-and-loop-counter

.mainloop:
mov eax, CACHEBLOCK / 16 ; note: .prefetchloop is unrolled 2X
add ecx, CACHEBLOCK ; move up to end of block

.prefetchloop:
mov ebx, [esi+ecx*8-64] ; read one address in this cache line...
mov ebx, [esi+ecx*8-128] ; ... and one in the previous line
sub ecx, 16 ; 16 QWORDS = 2 64-byte cache lines
dec eax
jnz .prefetchloop

mov eax, CACHEBLOCK / 8

.writeloop:
prefetchnta [esi+ecx*8 + 512] ; fetch ahead by 512 bytes

movq mm0, qword [esi+ecx*8]
movq mm1, qword [esi+ecx*8+8]
movq mm2, qword [esi+ecx*8+16]
movq mm3, qword [esi+ecx*8+24]
movq mm4, qword [esi+ecx*8+32]
movq mm5, qword [esi+ecx*8+40]
movq mm6, qword [esi+ecx*8+48]
movq mm7, qword [esi+ecx*8+56]

movntq qword [edi+ecx*8], mm0
movntq qword [edi+ecx*8+8], mm1
movntq qword [edi+ecx*8+16], mm2
movntq qword [edi+ecx*8+24], mm3
movntq qword [edi+ecx*8+32], mm4
movntq qword [edi+ecx*8+40], mm5
movntq qword [edi+ecx*8+48], mm6
movntq qword [edi+ecx*8+56], mm7

add ecx, 8
dec eax
jnz .writeloop

or ecx, ecx ; assumes integer number of cacheblocks
jnz .mainloop

sfence ; flush write buffer
emms

pop ebx
pop edi
pop esi

ret

至此，一個完整的內存複製函數的優化流程就結束了，通過對緩存的瞭解和使用，一次又一次地超越自己，最終獲得一個較爲令人滿意地結果。（號稱300%性能提示，實測175%-200%，也算相當不錯了）

在編寫測試代碼的時候需要注意兩點：

一是計時精度的問題，需要使用高精度的物理計數器，避免誤差。推薦使用rdtsc指令，然後根據CPU主頻計算時間。CPU主頻可以通過高精度計時器動態計算，我這兒偷懶直接從註冊表裏面讀取了 :P
代碼如下：

以下爲引用：

#ifdef WIN32
typedef __int64 uint64_t;
#else
#include <stdint.h>
#endif

bool GetPentiumClockEstimateFromRegistry(uint64_t& frequency)
{
HKEY hKey;

frequency = 0;

LONG rc = ::RegOpenKeyEx(HKEY_LOCAL_MACHINE, "Hardware//Description//System//CentralProcessor//0", 0, KEY_READ, &hKey);

if(rc == ERROR_SUCCESS)
{
DWORD cbBuffer = sizeof (DWORD);
DWORD freq_mhz;

rc = ::RegQueryValueEx(hKey, "~MHz", NULL, NULL, (LPBYTE)(&freq_mhz), &cbBuffer);

if (rc == ERROR_SUCCESS)
frequency = freq_mhz * MEGA;

RegCloseKey (hKey);
}

return frequency > 0;
}

void getTimeStamp(uint64_t& timeStamp)
{
#ifdef WIN32
__asm
{
push edx
push ecx
mov ecx, timeStamp
//_emit 0Fh // RDTSC
//_emit 31h
rdtsc
mov [ecx], eax
mov [ecx+4], edx
pop ecx
pop edx
}
#else
__asm__ __volatile__ ("rdtsc" : "=A" (timeStamp));
#endif
}

二是測試內存複製的緩衝區的大小，如果緩衝區過小，第一次拷貝兩個緩衝區時就會導致所有數據都被載入L2緩存中，得出比普通內存操作高一個數量級的數值。例如我的L2緩衝爲256K，如果我用兩個128K的緩衝區對着拷貝，無論循環多少次，速度都在普通內存複製的10倍左右。因此設置一個較大的值是必要的。

內存拷貝的優化方法

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

Win32 調試接口設計與實現淺析 [2] 調試事件

Win32 調試接口設計與實現淺析 [1] 用戶態調試器結構初探

CLR 中匿名函數的實現原理淺析

用WinDbg探索CLR世界[1] - 安裝與環境配置

Win32 核心 DPC 設計思想和實現思路淺析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結