無鎖隊列 SPSC Queue

在多線程編程中，一個著名的問題是生產者-消費者問題 (Producer Consumer Problem, PC Problem)。

對於這類問題，通過信號量加鎖 (https://www.cnblogs.com/sinkinben/p/14087750.html) 來設計 RingBuffer 是十分容易實現的，但欠缺性能。

考慮一個特殊的場景，生產者和消費者均只有一個 (Single Producer Single Consumer, SPSC)，在這種情況下，我們可以設計一個無鎖隊列來解決 PC 問題。

0. Background

考慮以下場景：在一個計算密集型 (Computing Intensive) 和延遲敏感的 for 循環當中，每次循環結束，需要打印當前的迭代次數以及計算結果。

void matrix_compute()
{
    for (i = 0 to n)
    {
        // code of computing
        ...
        // print i and result of computing
        std::cout << ...
    }
}

在這種情況下，如果使用簡單的 std::cout 輸出，由於 I/O 的性質，將會造成嚴重的延遲 (Latency)。

一個直觀的解決辦法是：將 Log 封裝爲一個字符串，傳遞給其他線程，讓其他線程打印該字符串，實現異步的 Logging 。

1. Lock-free SPSC Queue

此處使用一個 RingBuffer 來實現隊列。

由於是 SPSC 型的隊列，隊列頭部 head 只會被 Consumer 寫入，隊列尾部 tail 只會被 Producer 寫入，所以 SPSC Queue 可以是無鎖的，但需要保證寫入的原子性。

template <class T> class spsc_queue
{
  private:
    std::vector<T> m_buffer;
    std::atomic<size_t> m_head;
    std::atomic<size_t> m_tail;
  public:
    spsc_queue(size_t capacity) : m_buffer(capacity + 1), m_head(0), m_tail(0) {}
    inline bool enqueue(const T &item);
    inline bool dequeue(T &item);
};

對於一個 RingBuffer 而言，判空與判滿的方法如下：

Empty 的條件：head == tail
Full 的條件：(tail + 1) % N == head

因此，enqueue 和 dequeue 可以是以下的實現：

inline bool enqueue(const T &item)
{
    const size_t tail = m_tail.load(std::memory_order_relaxed);
    const size_t next = (tail + 1) % m_buffer.size();

    if (next == m_head.load(std::memory_order_acquire))
        return false;

    m_buffer[tail] = item;
    m_tail.store(next, std::memory_order_release);
    return true;
}

inline bool dequeue(T &item)
{
    const size_t head = m_head.load(std::memory_order_relaxed);

    if (head == m_tail.load(std::memory_order_acquire))
        return false;

    item = m_buffer[head];
    const size_t next = (head + 1) % m_buffer.size();
    m_head.store(next, std::memory_order_release);
    return true;
}

std::memory_order 的使用說明：https://en.cppreference.com/w/cpp/atomic/memory_order

Benchmark 計算 SPSC Queue 的吞吐量：

Mean:   29,158,897.200000 elements/s 
Median: 29,178,822.000000 elements/s 
Max:    29,315,199 elements/s 
Min:    28,995,515 elements/s

Benchmark 的計算方法爲：

Producer 和 Consumer 分別執行 1e8 次 enqueue 和 dequeue ，計算隊列爲空所耗費的總時間 t， 1e8 / t 即爲吞吐量。
上述過程執行 10 次，最終計算 mean, median, min, max 的值。

什麼是 Cache False Sharing? 參考 Architecture of Modern CPU 的 Exercise 一節。

int *a = new int[1024]; 
void worker(int idx)
{
    for (int j = 0; j < 1e9; j++)
        a[idx] = a[idx] + 1;
}

考慮以下程序：

P1: 開啓 2 線程，執行 worker(0), worker(1)
P2: 開啓 2 線程，執行 worker(0), worker(16)

P2 的執行速度會比 P1 快，現代 CPU 的 Cache Line 大小一般爲 64 字節，由於 a[0], a[1] 位於同一個 CPU Core 的同一個 Cache Line，每次寫入都會帶來數據競爭 (Data Race) ，觸發緩存和內存的同步（參考 MESI 協議），而 a[0], a[16] 之間相差了 64 字節，不在同一個 Cache Line，所以避免了這個問題。

所以，對於上述的 SPSC Queue，可以進行以下改進：

template <class T>
class spsc_queue
{
private:
    std::vector<T> m_buffer;
    alignas(64) std::atomic<size_t> m_head;
    alignas(64) std::atomic<size_t> m_tail;
};

這裏的 alignas(64) 實際上改爲 std::hardware_constructive_interference_size 更加合理，因爲 Cache Line 的大小取決於具體 CPU 硬件的實現，並不總是爲 64 字節。

#ifdef __cpp_lib_hardware_interference_size
using std::hardware_constructive_interference_size;
using std::hardware_destructive_interference_size;
#else
// 64 bytes on x86-64 │ L1_CACHE_BYTES │ L1_CACHE_SHIFT │ __cacheline_aligned │ ...
constexpr std::size_t hardware_constructive_interference_size = 64;
constexpr std::size_t hardware_destructive_interference_size = 64;
#endif

Benchmark 結果：

Mean:   38,993,940.400000 elements/s 
Median: 39,027,123.000000 elements/s 
Max:    39,253,946 elements/s 
Min:    38,624,197 elements/s

3. Remove useless memory access

在使用 spsc_queue 的時候，通常會有以下形式的代碼：

spsc_queue sq(1024);
// Producer keep spinning
int x = 233;
while (!sq.enqueue(x)) {}

而在 dequeue/enqueue 中，存在判空/判滿的代碼：

inline bool enqueue(const T &item)
{
    const size_t tail = m_tail.load(std::memory_order_relaxed);
    const size_t next = (tail + 1) % m_buffer.size();
    if (next == m_head.load(std::memory_order_acquire))
        return false;
    // ...
}

每次執行 m_head.load，Producer 線程的 CPU 都會訪問一次 m_head 所在的內存，但實際上觸發該條件的概率較小（因爲在實際的場景下， Producer/Consumer 都是計算密集型，否則根本不需要無鎖的數據結構）。在判空/判滿的時候，可以去 “離 CPU 更近” 的 Cache 去獲取 m_head 的值。

template <class T>
class spsc_queue
{
private:
    std::vector<T> m_buffer;
    alignas(hardware_constructive_interference_size) std::atomic<size_t> m_head;
    alignas(hardware_constructive_interference_size) std::atomic<size_t> m_tail;

    alignas(hardware_constructive_interference_size) size_t cached_head;
    alignas(hardware_constructive_interference_size) size_t cached_tail;
};

inline bool enqueue(const T &item)
{
    const size_t tail = m_tail.load(std::memory_order_relaxed);
    const size_t next = (tail + 1) % m_buffer.size();

    if (next == cached_head)
    {
        cached_head = m_head.load(std::memory_order_acquire);
        if (next == cached_head)
            return false;
    }
    // ...
}

Benchmark 結果：

Mean:   79,740,671.300000 elements/s 
Median: 79,838,314.000000 elements/s 
Max:    80,044,793 elements/s 
Min:    79,241,180 elements/s

4. Summary

Github: https://github.com/sinkinben/lock-free-queue

3 個版本的 spsc_queue 的吞吐量比較（均值，中位數，最大值，最小值）。在優化 Cache False Sharing 和優先從 Cache 讀取 head, tail 之後，可得到 x2 的提升。

無鎖隊列 SPSC Queue

0. Background

1. Lock-free SPSC Queue

3. Remove useless memory access

4. Summary

DAPPER 事務 TRANSACTION

無鎖隊列 SPSC Queue

Random Walk Problem

General Matrix Multiplication

Architecture of GPU and CUDA

Architecture of Modern CPU

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

無鎖隊列 SPSC Queue

0. Background

1. Lock-free SPSC Queue

2. Remove cache false sharing

3. Remove useless memory access

4. Summary