CAS原子操作以及Pthread Futex

维基百科给出的CAS定义：

In computer science, compare-and-swap (CAS) is an atomic instruction used in multithreading to achieve synchronization. It compares the contents of a memory location with a given value and, only if they are the same, modifies the contents of that memory location to a new given value. This is done as a single atomic operation. The atomicity guarantees that the new value is calculated based on up-to-date information; if the value had been updated by another thread in the meantime, the write would fail. The result of the operation must indicate whether it performed the substitution; this can be done either with a simple boolean response (this variant is often called compare-and-set), or by returning the value read from the memory location (not the value written to it).

Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again.

https://en.wikipedia.org/wiki/Compare-and-swap

Many C compilers support using compare-and-swap either with the C11 <stdatomic.h> functions,[8] or some non-standard C extension of that particular C compiler,[9] or by calling a function written directly in assembly language using the compare-and-swap instruction.

C11里面有对CAS原子操作的函数，或者可以通过结合内联汇编语言实现：

以Nginx源码为例, 其C语言内联汇编实现如下：

static ngx_inline ngx_atomic_uint_t
    ngx_atomic_cmp_set(ngx_atomic_t *lock, ngx_atomic_uint_t old,
        ngx_atomic_uint_t set)
    {
        u_char res;

        __asm__ volatile (

            NGX_SMP_LOCK
            " cmpxchgq %3, %1; "
            " sete %0; "

            : "=a" (res) : "m" (*lock), "a" (old), "r" (set) : "cc", "memory");

        return res;
    }

cmpxchgq 汇编指令可以保证在单核CPU下的原子操作。NGX_SMP_LOCK是CPU lock指令前缀宏，在多核架构下可以保证多核原子操作。

#if (NGX_SMP)                                 
#define NGX_SMP_LOCK  "lock;"                 
#else                                          
#define NGX_SMP_LOCK                           
#endif

lock指令前缀会设置处理器的LOCK#信号, 该信号会对内存总线进行锁定,阻止其他CPU通过总线访问内存，直到使用lock前缀的指令执行结束.
cmpxchgq %3, %1; 会将*lock内存中的值与RAX寄存器（已通过内联汇编"a" (old)将old值传入RAX）比较，如果相等那么将*lock设置为(set)中的新值，并将zf置1。
sete %0; 会根据 zf 标志位设置 %0，在这里就是 RAX 寄存器，并作为函数的返回值res。
cc表示会修改 flags 寄存器，memory表示会修改内存。

上面的内联代码实现CAS主要通过cmpxchgq汇编指令实现。

X86: CMPXCHG

Compare and Exchange

以下是Intel格式的汇编指令：

Opcode	Mnemonic	Description
`0F B0 /r`	`CMPXCHG r/m8,r8`	Compare AL with r/m8. If equal, ZF is set and r8 is loaded into r/m8. Else, clear ZF and load r/m8 into AL.
`0F B1 /r`	`CMPXCHG r/m16,r16`	Compare AX with r/m16. If equal, ZF is set and r16 is loaded into r/m16. Else, clear ZF and load r/m16 into AX
`0F B1 /r`	`CMPXCHG r/m32,r32`	Compare EAX with r/m32. If equal, ZF is set and r32 is loaded into r/m32. Else, clear ZF and load r/m32 into EAX

Description
Compares the value in the AL, AX, or EAX register (depending on the size of the operand) with the destination operand. If the two values are equal, the source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, or EAX register. This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor's bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

Description

Compares the value in the AL, AX, or EAX register (depending on the size of the operand) with the destination operand. If the two values are equal, the source operand is loaded into the destination operand. Otherwise, the destination operand is loaded into the AL, AX, or EAX register.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

To simplify the interface to the processor's bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

Flags affected
The ZF flag is set if the values in the destination operand and register AL, AX, or EAX are equal; otherwise it is cleared. The CF, PF, AF, SF, and OF flags are set according to the results of the comparison operation.

对于AT&T汇编来说，上述表格里的指令操作数需要互换，比如：

0F B1 /r CMPXCHG r32, r/m32 Compare EAX with r/m32. If equal, ZF is set and r32 is loaded into r/m32. Else, clear ZF and load r/m32 into EAX

即AT&T的CMPXCHG指令第一个操作数r32是源操作数，存在于寄存器，第二个操作数r/m32是目的操作数，可以存在于寄存器或者内存。

X86_64: CMPXCHGQ

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.

There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.

But, as stated, cmpxchgq is probably the better option for 64-bit code.

============================= Pthread Futex ===============================

futex - fast user-space locking

The futex() system call provides a method for waiting until a certain
       condition becomes true.  It is typically used as a blocking construct
       in the context of shared-memory synchronization.  When using futexes,
       the majority of the synchronization operations are performed in user
       space.  A user-space program employs the futex() system call only
       when it is likely that the program has to block for a longer time
       until the condition becomes true.  Other futex() operations can be
       used to wake any processes or threads waiting for a particular
       condition.

Most programmers will in fact not be using futexes directly but instead rely on system libraries built on them, such as the NPTL pthreads implementation.

Linux Implementations of POSIX Threads
Over time, two threading implementations have been provided by the GNU C library on Linux:

1) LinuxThreads
This is the original Pthreads implementation. Since glibc 2.4, this implementation is no longer supported.

2) NPTL (Native POSIX Threads Library)
This is the modern Pthreads implementation. By comparison with LinuxThreads, NPTL provides closer conformance to the requirements of the POSIX.1 specification and better performance when creating large numbers of threads. NPTL is available since glibc 2.3.2, and requires features that are present in the Linux 2.6 kernel.

Both of these are so-called 1:1 implementations, meaning that each thread maps to a kernel scheduling entity. In NPTL, thread synchronization primitives (mutexes, thread joining, etc.) are implemented using the Linux futex system call.

NPTL is now inside GNU Libc on Linux. An application compiled with gcc -pthread and linked with -pthread uses NPTL code on Linux today.

在无线程竞争的情况下，Futex会使调用线程在user space通过CAS直接锁定，无需进入kernel space, 避免context切换。

只有在有线程竞争的情况下，才会进入kernel space，使调用线程进入阻塞等待状态。

Pthread线程内部实现的mutex lock源码片段pthread_mutex_lock.c：

/* The PI support requires the Linux futex system call.  If that's not
	available, pthread_mutex_init should never have allowed the type to
	be set.  So it will get the default case for an invalid type.  */
#ifdef __NR_futex
......

static int
__pthread_mutex_lock_full (pthread_mutex_t *mutex)
	......
	
	newval
	    = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
						       newval, oldval);

参考：

https://mudongliang.github.io/x86/html/file_module_x86_id_41.html

https://stackoverflow.com/questions/833122/cmpxchg-example-for-64-bit-integer

http://man7.org/linux/man-pages/man2/futex.2.html

https://github.molgen.mpg.de/git-mirror/glibc/blob/master/nptl/pthread_mutex_lock.c

https://www.unix.com/man-page/linux/7/pthreads/