atomic的底層實現

atomic操作

在編程過程中我們經常會使用到原子操作,這種操作即不想互斥鎖那樣耗時,又可以保證對變量操作的原子性,常見的原子操作有fetch_add、load、increment等。

而對於atomic的實現最基礎的解釋:原子操作是由底層硬件支持的一種特性。

底層硬件支持,到底是怎麼樣的一種支持?首先編寫一個簡單的示例代碼:

#include <atomic>

int main()
{
    std::atomic<int> a;
    //a = 1;
    a++;
    return 0;
}

然後進行編譯, 查看編譯文件:

g++ -S atomic.cc
cat atomic.s

_ZNSt13__atomic_baseIiEppEi:
.LFB362:
	pushq	%rbp
	.seh_pushreg	%rbp
	movq	%rsp, %rbp
	.seh_setframe	%rbp, 0
	subq	$16, %rsp
	.seh_stackalloc	16
	.seh_endprologue
	movq	%rcx, 16(%rbp)
	movl	%edx, 24(%rbp)
	movq	16(%rbp), %rax
	movq	%rax, -8(%rbp)
	movl	$1, -12(%rbp)
	movl	$5, -16(%rbp)
	movl	-12(%rbp), %edx
	movq	-8(%rbp), %rax
	lock xaddl	%edx, (%rax)
	movl	%edx, %eax
	nop
	addq	$16, %rsp
	popq	%rbp
	ret
	.seh_endproc
	.ident	"GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"

我們可以看到在執行自增操作的時候,在xaddl 指令前多了一個lock前綴,而cpu對這個lock指令的支持就是所謂的底層硬件支持。

增加這個前綴後,保證了 load-add-store 步驟的不可分割性。

lock 指令的實現

衆所周知,cpu在執行任務的時候並不是直接從內存中加載數據,而是會先將數據加載到L1L2 cache中(典型的是兩層緩存,甚至可能更多),然後再從cache中讀取數據進行運算。

而現在的計算機通常都是多核處理器,每一個內核都對應一個獨立的L1層緩存,多核之間的緩存數據同步是cpu框架設計的重要部分,MESI是比較常用的多核緩存同步方案。

當我們在單線程內執行 atomic++操作,自然是不會發生多核之間數據不同步的問題,但是我們在多線程多核的情況下,cpu是如何保證Lock特性的?

作者這裏以intel x86架構的cpu爲例進行說明,首先給出官方的說明文檔:

User level locks involve utilizing the atomic instructions of processor to atomically update a memory space. 
The atomic instructions involve utilizing a lock prefix on the instruction and having the destination operand assigned to a memory address. 
The following instructions can run atomically with a lock prefix on current Intel processors: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. EnterCriticalSection utilizes atomic instructions to attempt to get a user-land lock before jumping into the kernel. On most instructions a lock prefix must be explicitly used except for the xchg instruction where the lock prefix is implied if the instruction involves a memory address.

In the days of Intel 486 processors, the lock prefix used to assert a lock on the bus along with a large hit in performance.
Starting with the Intel Pentium Pro architecture, the bus lock is transformed into a cache lock. 
A lock will still be asserted on the bus in the most modern architectures if the lock resides in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines. 
Both of these scenarios are unlikely, so most lock prefixes will be transformed into a cache lock which is much less expensive.

上面說明了lock前綴實現原子性的兩種方式:

  • 鎖bus:性能消耗大,在intel 486處理器上用此種方式實現
  • 鎖cache:在現代處理器上使用此種方式,但是在無法鎖定cache的時候(如果鎖駐留在不可緩存的內存中,或者如果鎖超出了劃分cache line 的cache boundy),仍然會去鎖定總線。

大多數人看到這裏可能感覺已經懂了,但實際還不夠,bus lock 以及多核之間的cache lock是如何實現的?

The LOCK prefix (F0H) forces an operation that ensures exclusive use of shared memory in a multiprocessor environment.
See “LOCK—Assert LOCK# Signal Prefix” in Chapter 3, “Instruction Set Reference, A-L,” for a description of this prefix

Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.
In most IA-32 and all Intel 64 processors, locking may occur without the LOCK# signal being asserted. See the “IA32 Architecture Compatibility” section below for more details.
The LOCK prefix can be prepended only to the following instructions and only to those forms of the instructions where the destination operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, CMPXCHG16B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. 
If the LOCK prefix is used with one of these instructions and the source operand is a memory operand, an undefined opcode exception (#UD) may be generated. An undefined opcode exception will also be generated if the LOCK prefix is used with any instruction not in the above list. 
The XCHG instruction always asserts the LOCK# signal regardless of the presence or absence of the LOCK prefix.
The LOCK prefix is typically used with the BTS instruction to perform a read-modify-write operation on a memory location in shared memory environment.
The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.
This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

在Intel的官方文檔上標明,一個LOCK前綴強制性的確保一個操作在多核環境的shared memory中操作。LOCK前綴的完整性不受存儲字段對齊的影響,對於任意未對齊的字段內存鎖定都可以被觀察到。

BUS LOCK

這是Intel官方對bus lock的說明

Intel processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link. 
While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. This metric measures the ratio of bus cycles, during which a LOCK# signal is asserted on the bus. 
The LOCK# signal is asserted when there is a locked memory access due to uncacheable memory, locked operation that spans two cache lines, and page-walk from an uncacheable page table.

英特爾處理器提供LOCK#信號,該信號在某些關鍵內存操作期間會自動斷言,以鎖定系統總線或等效鏈接。在斷言該輸出信號時,來自其他處理器或總線代理的控制總線的請求將被阻止。此度量標準度量總線週期的比率,在此期間,在總線上聲明LOCK#信號。當由於不可緩存的內存,跨越兩條緩存行的鎖定操作以及來自不可緩存的頁表的頁面遍歷而導致存在鎖定的內存訪問時,將發出LOCK#信號。

在這裏,鎖定進入操作由總線上的一條消息組成,上面寫着“好,每個人都退出總線一段時間”(出於我們的目的,這意味着“停止執行內存操作”)。然後,發送該消息的內核需要等待所有其他內核完成正在執行的內存操作,然後它們將確認鎖定。只有在其他所有內核都已確認之後,嘗試鎖定操作的內核才能繼續進行。最終,一旦鎖定被釋放,它再次需要向總線上的每個人發送一條消息,說:“一切都清楚了,您現在就可以繼續在總線上發出請求了”。

CACHE LOCK

cache lock 要比bus lock 複雜很多,這裏涉及到內核cache同步,還有 memory barrierscache lineshared memory等概念,後續會持續更新。

其它

LOCK prefix 不僅僅用於atomic的實現,在其他的一些用戶層的同步操作也會應用到,比如依賴於LOCK XCHG實現自旋鎖等。

參考網址

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章