C++中atomic和mutex的對比

最近在優化自己以前寫的一個程序，其中io部分由單線程的Reactor模型改成多線程的Proactor模型。即原來是異步io事件喚醒線程，進行io讀寫，現在是一個線程進行異步io讀寫，然後把數據交給另一個線程進行邏輯處理。那這就涉及到一個線程數據交換的問題，由於是io數據，這個需要交換的數據還比較大，即交換一大塊內存(緩衝區)。這本來也沒多大事，這個都是很成熟的設計，無非就是加個鎖，或者使用無鎖的環形緩衝區即可。但我寫着寫着就魔怔了，非常糾結加鎖效率高，還是用無鎖環形緩衝區高？雖然知道寫業務邏輯不應該糾結這種技術細節，他們的差別不會太大，用哪個對業務上的感知是沒有區別的。但不寫個程序測試一下，心裏就不舒服。

首選說我的程序，典型的一個線程產出數據，一個線程消耗數據，但產出和消耗邏輯互不相關，只是交換數據那一瞬間需要加鎖，因此實際上出現競爭的概率不算大，所以我更在意的是當沒有競爭時，他們的表現怎麼樣。而實現無鎖環形緩衝區，大概需要2~3個atomic變量，如果使用鎖，那就只是一把鎖，直接上代碼：

#include <iostream>
#include <atomic>
#include <chrono>
#include <mutex>

#ifndef _MSC_VER
#include <pthread.h>
#endif

/// 用std::atomic_flag實現的spin lock
class SpinLock final
{
public:
	SpinLock()
	{
		_flag.clear();

		// C++20 This macro is no longer needed and deprecated, 
		// since default constructor of std::atomic_flag initializes it to clear state.
		// _flag = ATOMIC_FLAG_INIT;
	}
	~SpinLock() = default;
	SpinLock(const SpinLock&) = delete;
	SpinLock(const SpinLock&&) = delete;

	void lock() noexcept
	{
		// https://en.cppreference.com/w/cpp/atomic/atomic_flag_test_and_set
		// Example A spinlock mutex can be implemented in userspace using an atomic_flag

		// 
		while (_flag.test_and_set(std::memory_order_acquire));
	}

	bool try_lock() noexcept
	{
		return !_flag.test_and_set(std::memory_order_acquire);
	}

	void unlock() noexcept
	{
		_flag.clear(std::memory_order_release);
	}
private:
	std::atomic_flag _flag;
};

#ifndef _MSC_VER
/// https://rigtorp.se/spinlock/
struct spinlock {
	std::atomic<bool> lock_ = { 0 };

	void lock() noexcept {
		for (;;) {
			// Optimistically assume the lock is free on the first try
			if (!lock_.exchange(true, std::memory_order_acquire)) {
				return;
			}
			// Wait for lock to be released without generating cache misses
			while (lock_.load(std::memory_order_relaxed)) {
				// Issue X86 PAUSE or ARM YIELD instruction to reduce contention between
				// hyper-threads
				__builtin_ia32_pause();
			}
		}
	}

	bool try_lock() noexcept {
		// First do a relaxed load to check if lock is free in order to prevent
		// unnecessary cache misses if someone does while(!try_lock())
		return !lock_.load(std::memory_order_relaxed) &&
			!lock_.exchange(true, std::memory_order_acquire);
	}

	void unlock() noexcept {
		lock_.store(false, std::memory_order_release);
	}
};
#endif

int main()
{
	const int ts = 10000000;

	int ii1 = 0;
	int ii2 = 0;
	int ii3 = 0;

	std::atomic<int> i1(0);
	std::atomic<int> i2(0);
	std::atomic<int> i3(0);

	std::chrono::steady_clock::time_point beg;
	std::chrono::steady_clock::time_point end;

	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run int      " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		i1 += i / 2;
		i2 += 1;
		i3 += i1;
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run atomic   " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	std::mutex m;
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		m.lock();
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		m.unlock();
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run mutex    " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	SpinLock l;
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		l.lock();
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		l.unlock();
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run SpinLock " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

#ifndef _MSC_VER
	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	spinlock ll;
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		ll.lock();
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		ll.unlock();
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run spinlock " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;

	// using pthread_spin_lock
	// https://docs.oracle.com/cd/E26502_01/html/E35303/ggecq.html
	ii1 = 0;
	ii2 = 0;
	ii3 = 0;
	pthread_spinlock_t lll;
	int pshared;
	int ret;

	/* initialize a spin lock */
	ret = pthread_spin_init(&lll, pshared);
	beg = std::chrono::steady_clock::now();
	for (int i = 0; i < ts; i++)
	{
		pthread_spin_lock(&lll);
		ii1 += i / 2;
		ii2 += 1;
		ii3 += ii1;
		pthread_spin_unlock(&lll);
	}
	end = std::chrono::steady_clock::now();
	std::cout << "run pthread_spinlock_t " << ts << " time cost (ms) = "
		<< std::chrono::duration_cast<std::chrono::milliseconds>(end - beg).count() << std::endl;
#endif

	return 0;
}

編譯參數：win下Visual Studio 2022，默認設置，linux下爲g++ --std=c++11 Test.cpp，結果：

物理機 CentOS 7, CPU I5-4460

atomic 267ms
mutex 162ms

VirutlBox虛擬機，Debian 10，筆記本CPU

run 10000000 time cost (ms) = 326
run 10000000 time cost (ms) = 587
run 10000000 time cost (ms) = 729
run 10000000 time cost (ms) = 834

物理機 Win10, CPU AMD5700G

run int      10000000 time cost (ms) = 9
run atomic   10000000 time cost (ms) = 187
run mutex    10000000 time cost (ms) = 233
run SpinLock 10000000 time cost (ms) = 152

VirtualBox虛擬機 Debian 10, CPU 5700G

run int      10000000 time cost (ms) = 9
run atomic   10000000 time cost (ms) = 92
run mutex    10000000 time cost (ms) = 200
run SpinLock 10000000 time cost (ms) = 194
run spinlock 10000000 time cost (ms) = 234
run pthread_spinlock_t 10000000 time cost (ms) = 119

需要注意下，CentOS 7那臺物理器，一開始我只寫了atomic和mutex的對比。後面我加了其他對比，但那臺機子暫時沒法用了，所以就只有兩個數據。而win下，spinlock使用了一個linux下才有的函數，所以少了一個數據，在不同機子上的測試時我經常手動改代碼，沒有繼續回到原來的機子重新，所以輸出有些不一樣，但邏輯是一樣的。

在完全沒有競爭的條件下，這些數據比較有意思：

atomic類型比原生的int類型要慢很多
linux下結果比較統一，int > atomic > SpinLock > mutex，2個atomic大概等於一個mutex
win下 int > SpinLock > atomic mutex
linux下mutex和atomic的實現要比win下快很多

由於我的程序多半是跑在linux下，win下的結果就不分析了。在linux下，2個atomic大概等於一個mutex，這是符合預期的。因爲一個mutex在沒有競爭的條件下，就是compare and set兩條指令，一次lock，一次unlock，相當於操作2個atomic。而使用atomic的SpinLock，和mutex幾乎一致，有時候比mutex快，有時候慢，但相差不多。我擔心這個實現方法效率不高，於是又在網上找了別人實現的一個，它還用了一個linux下專有的pause函數，結果發現更慢。接着使用pthread_spinlock_t測試，這個效率就很高，當然也有可能是編譯參數的原因(pthread連接的是庫，編譯參數不一樣)，寫得這個測試程序太簡單，不能加優化(加O2優化鎖直接就被優化掉了)。但pthread_spinlock_t並不在C++標準中，因此我是不太可能用它的，剩下的區別不大。

上面測試的是完全無競爭的情況，沒有測試有競爭的情況，因爲在有競爭的情況下，atomic、mutex、SpinLock的表現不一樣，是根據業務邏輯用哪個的問題，而不是對比哪個效率高。

atomic只能保證自身變量讀寫的一致性，保證不了邏輯的一致性，它不能當作一個鎖來用
mutex是用來保證邏輯的一致性（如果只是一個變量，用atomic就不用考慮鎖）。mutex在出現競爭時，會進入內核態，並讓出CPU，因此適合需要加鎖執行較長時間的邏輯
SpinLock也是用來保證邏輯的一致性，但它不會讓出CPU，適合加鎖執行較短時間的邏輯
像下面的代碼，它執行的是一個push邏輯，因此不考慮atomic，push的邏輯明顯只需要很短時間，因此SpinLock比較合適。

std::vector v;
lock();
v.push(1);
unlock();

所以對於出現競爭的情況，是要根據業務邏輯實際情況來判斷用哪個，寫個簡單的for循環程序來模擬是沒什麼意義的，mutex肯定是最慢的，但它能讓出CPU，這在現實的程序中有很大的意義。

C++中atomic和mutex的對比

DAPPER 事務 TRANSACTION

SIGPIPE導致程序無故中止

Lua 5.3 hashint函數缺陷導致遍歷table性能非常差

Socket緩衝區過小觸發TCP Nagle's algorithm算法導致網絡延遲大

C++ STL chrono和clock_gettime的性能對比

Program terminated with signal 4, Illegal instruction

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結