CAS原子操作以及Pthread Futex

維基百科給出的CAS定義：

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it).

Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again.

https://en.wikipedia.org/wiki/Compare-and-swap

Many C compilers support using compare-and-swap either with the C11 <stdatomic.h> functions,[8] or some non-standard C extension of that particular C compiler,[9] or by calling a function written directly in assembly language using the compare-and-swap instruction.

C11裏面有對CAS原子操作的函數，或者可以通過結合內聯彙編語言實現：

以Nginx源碼爲例, 其C語言內聯彙編實現如下：

static ngx_inline ngx_atomic_uint_t
    ngx_atomic_cmp_set(ngx_atomic_t *lock, ngx_atomic_uint_t old,
        ngx_atomic_uint_t set)
    {
        u_char res;

        __asm__ volatile (

            NGX_SMP_LOCK
            " cmpxchgq %3, %1; "
            " sete %0; "

            : "=a" (res) : "m" (*lock), "a" (old), "r" (set) : "cc", "memory");

        return res;
    }

cmpxchgq 彙編指令可以保證在單核CPU下的原子操作。NGX_SMP_LOCK是CPU lock指令前綴宏，在多核架構下可以保證多核原子操作。

#if (NGX_SMP)                                 
#define NGX_SMP_LOCK  "lock;"                 
#else                                          
#define NGX_SMP_LOCK                           
#endif

lock指令前綴會設置處理器的LOCK#信號, 該信號會對內存總線進行鎖定,阻止其他CPU通過總線訪問內存，直到使用lock前綴的指令執行結束.
cmpxchgq %3, %1; 會將*lock內存中的值與RAX寄存器（已通過內聯彙編"a" (old)將old值傳入RAX）比較，如果相等那麼將*lock設置爲(set)中的新值，並將zf置1。
sete %0; 會根據 zf 標誌位設置 %0，在這裏就是 RAX 寄存器，並作爲函數的返回值res。
cc表示會修改 flags 寄存器，memory表示會修改內存。

上面的內聯代碼實現CAS主要通過cmpxchgq彙編指令實現。

X86: CMPXCHG

Compare and Exchange

以下是Intel格式的彙編指令：

Opcode	Mnemonic	Description
`0F B0 /r`	`CMPXCHG r/m8,r8`	Compare AL with r/m8. If equal, ZF is set and r8 is loaded into r/m8. Else, clear ZF and load r/m8 into AL.
`0F B1 /r`	`CMPXCHG r/m16,r16`	Compare AX with r/m16. If equal, ZF is set and r16 is loaded into r/m16. Else, clear ZF and load r/m16 into AX
`0F B1 /r`	`CMPXCHG r/m32,r32`	Compare EAX with r/m32. If equal, ZF is set and r32 is loaded into r/m32. Else, clear ZF and load r/m32 into EAX

Description
Compares the value in the AL, AX, or EAX register (depending on the size of the operand) with the destination operand. If the two values are equal, the source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, or EAX register. This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor's bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

Description

Compares the value in the AL, AX, or EAX register (depending on the size of the operand) with the destination operand. If the two values are equal, the source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, or EAX register.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

To simplify the interface to the processor's bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

Flags affected
The ZF flag is set if the values in the destination operand and register AL, AX, or EAX are equal; otherwise it is cleared. The CF, PF, AF, SF, and OF flags are set according to the results of the comparison operation.

對於AT&T彙編來說，上述表格裏的指令操作數需要互換，比如：

0F B1 /r CMPXCHG r32, r/m32 Compare EAX with r/m32. If equal, ZF is set and r32 is loaded into r/m32. Else, clear ZF and load r/m32 into EAX

即AT&T的CMPXCHG指令第一個操作數r32是源操作數，存在於寄存器，第二個操作數r/m32是目的操作數，可以存在於寄存器或者內存。

X86_64: CMPXCHGQ

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.

There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.

But, as stated, cmpxchgq is probably the better option for 64-bit code.

============================= Pthread Futex ===============================

futex - fast user-space locking

The futex() system call provides a method for waiting until a certain
       condition becomes true.  It is typically used as a blocking construct
       in the context of shared-memory synchronization.  When using futexes,
       the majority of the synchronization operations are performed in user
       space.  A user-space program employs the futex() system call only
       when it is likely that the program has to block for a longer time
       until the condition becomes true.  Other futex() operations can be
       used to wake any processes or threads waiting for a particular
       condition.

Most programmers will in fact not be using futexes directly but instead rely on system libraries built on them, such as the NPTL pthreads implementation.

Linux Implementations of POSIX Threads
Over time, two threading implementations have been provided by the GNU C library on Linux:

1) LinuxThreads
This is the original Pthreads implementation. Since glibc 2.4, this implementation is no longer supported.

2) NPTL (Native POSIX Threads Library)
This is the modern Pthreads implementation. By comparison with LinuxThreads, NPTL provides closer conformance to the requirements of the POSIX.1 specification and better performance when creating large numbers of threads. NPTL is available since glibc 2.3.2, and requires features that are present in the Linux 2.6 kernel.

Both of these are so-called 1:1 implementations, meaning that each thread maps to a kernel scheduling entity. In NPTL, thread synchronization primitives (mutexes, thread joining, etc.) are implemented using the Linux futex system call.

NPTL is now inside GNU Libc on Linux. An application compiled with gcc -pthread and linked with -pthread uses NPTL code on Linux today.

在無線程競爭的情況下，Futex會使調用線程在user space通過CAS直接鎖定，無需進入kernel space, 避免context切換。

只有在有線程競爭的情況下，纔會進入kernel space，使調用線程進入阻塞等待狀態。

Pthread線程內部實現的mutex lock源碼片段pthread_mutex_lock.c：

/* The PI support requires the Linux futex system call.  If that's not
	available, pthread_mutex_init should never have allowed the type to
	be set.  So it will get the default case for an invalid type.  */
#ifdef __NR_futex
......

static int
__pthread_mutex_lock_full (pthread_mutex_t *mutex)
	......
	
	newval
	    = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
						       newval, oldval);

參考：

https://mudongliang.github.io/x86/html/file_module_x86_id_41.html

https://stackoverflow.com/questions/833122/cmpxchg-example-for-64-bit-integer

http://man7.org/linux/man-pages/man2/futex.2.html

https://github.molgen.mpg.de/git-mirror/glibc/blob/master/nptl/pthread_mutex_lock.c

https://www.unix.com/man-page/linux/7/pthreads/