mutex spinlock

from: http://www.parallellabs.com/2010/01/31/pthreads-programming-spin-lock-vs-mutex-performance-analysis/

POSIX threads(簡稱Pthreads)是在多核平臺上進行並行編程的一套常用的API。線程同步(Thread Synchronization)是並行編程中非常重要的通訊手段，其中最典型的應用就是用Pthreads提供的鎖機制(lock)來對多個線程之間共享的臨界區(Critical Section)進行保護(另一種常用的同步機制是barrier)。

Pthreads提供了多種鎖機制：
(1) Mutex（互斥量）：pthread_mutex_***
(2) Spin lock（自旋鎖）：pthread_spin_***
(3) Condition Variable（條件變量）：pthread_con_***
(4) Read/Write lock（讀寫鎖）：pthread_rwlock_***

Pthreads提供的Mutex鎖操作相關的API主要有：
pthread_mutex_lock (pthread_mutex_t *mutex);
pthread_mutex_trylock (pthread_mutex_t *mutex);
pthread_mutex_unlock (pthread_mutex_t *mutex);

Pthreads提供的與Spin Lock鎖操作相關的API主要有：
pthread_spin_lock (pthread_spinlock_t *lock);
pthread_spin_trylock (pthread_spinlock_t *lock);
pthread_spin_unlock (pthread_spinlock_t *lock);

從實現原理上來講，Mutex屬於sleep-waiting類型的鎖。例如在一個雙核的機器上有兩個線程(線程A和線程B)，它們分別運行在Core0和Core1上。假設線程A想要通過pthread_mutex_lock操作去得到一個臨界區的鎖，而此時這個鎖正被線程B所持有，那麼線程A就會被阻塞(blocking)，Core0 會在此時進行上下文切換(Context Switch)將線程A置於等待隊列中，此時Core0就可以運行其他的任務(例如另一個線程C)而不必進行忙等待。而Spin lock則不然，它屬於busy-waiting類型的鎖，如果線程A是使用pthread_spin_lock操作去請求鎖，那麼線程A就會一直在 Core0上進行忙等待並不停的進行鎖請求，直到得到這個鎖爲止。

如果大家去查閱Linux glibc中對pthreads API的實現NPTL(Native POSIX Thread Library ) 的源碼的話(使用”getconf GNU_LIBPTHREAD_VERSION”命令可以得到我們系統中NPTL的版本號)，就會發現pthread_mutex_lock()操作如果沒有鎖成功的話就會調用system_wait()的系統調用（現在NPTL的實現採用了用戶空間的futex ，不需要頻繁進行系統調用，性能已經大有改善），並將當前線程加入該mutex的等待隊列裏。而spin lock則可以理解爲在一個while(1)循環中用內嵌的彙編代碼實現的鎖操作(印象中看過一篇論文介紹說在linux內核中spin lock操作只需要兩條CPU指令，解鎖操作只用一條指令就可以完成)。有興趣的朋友可以參考另一個名爲sanos 的微內核中pthreds API的實現：mutex.c spinlock.c ，儘管與NPTL中的代碼實現不盡相同，但是因爲它的實現非常簡單易懂，對我們理解spin lock和mutex的特性還是很有幫助的。

那麼在實際編程中mutex和spin lcok哪個的性能更好呢？我們知道spin lock在Linux內核中有非常廣泛的利用，那麼這是不是說明spin lock的性能更好呢？下面讓我們來用實際的代碼測試一下（請確保你的系統中已經安裝了最近的g++）。


//
Name: spinlockvsmutex1.cc

//
Source: 
http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock 

//
Compiler(spin lock version): g++ -o spin_version -DUSE_SPINLOCK spinlockvsmutex1.cc -lpthread

//
Compiler(mutex version): g++ -o mutex_version spinlockvsmutex1.cc -lpthread

#include
<stdio.h>

#include
<unistd.h>

#include
<sys/syscall.h>

#include
<errno.h>

#include
<sys/time.h>

#include
<list>

#include
<pthread.h>

#define
LOOPS 50000000

using

namespace 
std;

list<int>
the_list;

#ifdef
USE_SPINLOCK

pthread_spinlock_t
spinlock;

#else

pthread_mutex_t
mutex;

#endif

//Get
the thread id

pid_t
gettid() { return

syscall( __NR_gettid ); }

void

*consumer(void

*ptr)

{

    int

i;

    printf("Consumer
TID %lun",
(unsigned long)gettid());

    while

(1)

    {

#ifdef
USE_SPINLOCK

        pthread_spin_lock(&spinlock);

#else

        pthread_mutex_lock(&mutex);

#endif

        if

(the_list.empty())

        {

#ifdef
USE_SPINLOCK

            pthread_spin_unlock(&spinlock);

#else

            pthread_mutex_unlock(&mutex);

#endif

            break;

        }

        i
= the_list.front();

        the_list.pop_front();

#ifdef
USE_SPINLOCK

        pthread_spin_unlock(&spinlock);

#else

        pthread_mutex_unlock(&mutex);

#endif

    }

    return

NULL;

}

int

main()

{

    int

i;

    pthread_t
thr1, thr2;

    struct

timeval tv1, tv2;

#ifdef
USE_SPINLOCK

    pthread_spin_init(&spinlock,
0);

#else

    pthread_mutex_init(&mutex,
NULL);

#endif

    //
Creating the list content...

    for

(i = 0; i < LOOPS; i++)

        the_list.push_back(i);

    //
Measuring time before starting the threads...

    gettimeofday(&tv1,
NULL);

    pthread_create(&thr1,
NULL, consumer, NULL);

    pthread_create(&thr2,
NULL, consumer, NULL);

    pthread_join(thr1,
NULL);

    pthread_join(thr2,
NULL);

    //
Measuring time after threads finished...

    gettimeofday(&tv2,
NULL);

    if

(tv1.tv_usec > tv2.tv_usec)

    {

        tv2.tv_sec--;

        tv2.tv_usec
+= 1000000;

    }

    printf("Result
- %ld.%ldn",
tv2.tv_sec - tv1.tv_sec,

        tv2.tv_usec
- tv1.tv_usec);

#ifdef
USE_SPINLOCK

    pthread_spin_destroy(&spinlock);

#else

    pthread_mutex_destroy(&mutex);

#endif

    return

0;

}

該程序運行過程如下：主線程先初始化一個list結構，並根據LOOPS的值將對應數量的entry插入該list，之後創建兩個新線程，它們都執行consumer()這個任務。兩個被創建的新線程同時對這個list進行pop操作。主線程會計算從創建兩個新線程到兩個新線程結束之間所用的時間，輸出爲下文中的”Result “。

測試機器參數：
Ubuntu 9.04 X86_64
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
4.0 GB Memory

從下面是測試結果：


gchen@gchen-desktop:~/Workspace/mutex$
g++ -o spin_version -DUSE_SPINLOCK spinvsmutex1.cc -lpthread

gchen@gchen-desktop:~/Workspace/mutex$
g++ -o mutex_version spinvsmutex1.cc -lpthread

gchen@gchen-desktop:~/Workspace/mutex$
time

./spin_version

Consumer
TID 5520

Consumer
TID 5521

Result
- 5.888750

real   
0m10.918s

user   
0m15.601s

sys  
 0m0.804s

gchen@gchen-desktop:~/Workspace/mutex$
time

./mutex_version

Consumer
TID 5691

Consumer
TID 5692

Result
- 9.116376

real  
 0m14.031s

user  
 0m12.245s

sys  
 0m4.368s

可以看見spin lock的版本在該程序中表現出來的性能更好。另外值得注意的是sys時間，mutex版本花費了更多的系統調用時間，這就是因爲mutex會在鎖衝突時調用system wait造成的。

但是，是不是說spin lock就一定更好了呢？讓我們再來看一個鎖衝突程度非常劇烈的實例程序：


//Name:
svm2.c

//Source:

http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Locks 

//Compile(spin
lock version): gcc -o spin -DUSE_SPINLOCK svm2.c -lpthread

//Compile(mutex
version): gcc -o mutex svm2.c -lpthread

#include
<stdio.h>

#include
<stdlib.h>

#include
<pthread.h>

#include
<sys/syscall.h>

#define       
THREAD_NUM     2

pthread_t
g_thread[THREAD_NUM];

#ifdef
USE_SPINLOCK

pthread_spinlock_t
g_spin;

#else

pthread_mutex_t
g_mutex;

#endif

__uint64_t
g_count;

pid_t
gettid()

{

    return

syscall(SYS_gettid);

}

void

*run_amuck(void

*arg)

{

       int

i, j;

       printf("Thread
%lu started.n",
(unsigned long)gettid());

       for

(i = 0; i < 10000; i++) {

#ifdef
USE_SPINLOCK

           pthread_spin_lock(&g_spin);

#else

               pthread_mutex_lock(&g_mutex);

#endif

               for

(j = 0; j < 100000; j++) {

                       if

(g_count++ == 123456789)

                               printf("Thread
%lu wins!n",
(unsigned long)gettid());

               }

#ifdef
USE_SPINLOCK

           pthread_spin_unlock(&g_spin);

#else

               pthread_mutex_unlock(&g_mutex);

#endif

       }

       printf("Thread
%lu finished!n",
(unsigned long)gettid());

       return

(NULL);

}

int

main(int

argc, char

*argv[])

{

       int

i, threads = THREAD_NUM;

       printf("Creating
%d threads...n",
threads);

#ifdef
USE_SPINLOCK

       pthread_spin_init(&g_spin,
0);

#else

       pthread_mutex_init(&g_mutex,
NULL);

#endif

       for

(i = 0; i < threads; i++)

               pthread_create(&g_thread[i],
NULL, run_amuck, (void

*) i);

       for

(i = 0; i < threads; i++)

               pthread_join(g_thread[i],
NULL);

       printf("Done.n");

       return

(0);

}

這個程序的特徵就是臨界區非常大，這樣兩個線程的鎖競爭會非常的劇烈。當然這個是一個極端情況，實際應用程序中臨界區不會如此大，鎖競爭也不會如此激烈。測試結果顯示mutex版本性能更好：


gchen@gchen-desktop:~/Workspace/mutex$
time

./spin

Creating
2 threads...

Thread
31796 started.

Thread
31797 started.

Thread
31797 wins!

Thread
31797 finished!

Thread
31796 finished!

Done.

real   
0m5.748s

user   
0m10.257s

sys   
0m0.004s

gchen@gchen-desktop:~/Workspace/mutex$
time

./mutex

Creating
2 threads...

Thread
31801 started.

Thread
31802 started.

Thread
31802 wins!

Thread
31802 finished!

Thread
31801 finished!

Done.

real   
0m4.823s

user   
0m4.772s

sys   
0m0.032s

另外一個值得注意的細節是spin lock耗費了更多的user time。這就是因爲兩個線程分別運行在兩個核上，大部分時間只有一個線程能拿到鎖，所以另一個線程就一直在它運行的core上進行忙等待，CPU佔用率一直是100%；而mutex則不同，當對鎖的請求失敗後上下文切換就會發生，這樣就能空出一個核來進行別的運算任務了。（其實這種上下文切換對已經拿着鎖的那個線程性能也是有影響的，因爲當該線程釋放該鎖時它需要通知操作系統去喚醒那些被阻塞的線程，這也是額外的開銷）

總結
（1）Mutex適合對鎖操作非常頻繁的場景，並且具有更好的適應性。儘管相比spin lock它會花費更多的開銷（主要是上下文切換），但是它能適合實際開發中複雜的應用場景，在保證一定性能的前提下提供更大的靈活度。

（2）spin lock的lock/unlock性能更好(花費更少的cpu指令)，但是它只適應用於臨界區運行時間很短的場景。而在實際軟件開發中，除非程序員對自己的程序的鎖操作行爲非常的瞭解，否則使用spin lock不是一個好主意(通常一個多線程程序中對鎖的操作有數以萬次，如果失敗的鎖操作(contended lock requests)過多的話就會浪費很多的時間進行空等待)。

（3）更保險的方法或許是先（保守的）使用 Mutex，然後如果對性能還有進一步的需求，可以嘗試使用spin lock進行調優。畢竟我們的程序不像Linux kernel那樣對性能需求那麼高(Linux Kernel最常用的鎖操作是spin lock和rw lock)。

2010年3月3日補記：這個觀點在Oracle的文檔中得到了支持：

During configuration, Berkeley DB selects a mutex implementation for the architecture. Berkeley DB normally prefers blocking-mutex implementations over non-blocking ones. For example, Berkeley DB will select POSIX pthread mutex interfaces rather than assembly-code test-and-set spin mutexes because pthread mutexes are usually more efficient and less likely to waste CPU cycles spinning without getting any work accomplished.

p.s.調用syscall(SYS_gettid)和syscall( __NR_gettid )都可以得到當前線程的id:)

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

mutex spinlock

Minix系統inode管理

Q&A With Nine Great Programmers

進程同步 & 互斥

Minix 文件信息及其數據的定位

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結