【C++相關】一個從提高QPS引發的問題

一個從提高QPS引發的問題

業務背景：

人臉識別中有一個對計算量要求較高的操作，人臉比對操作；正常情況下人臉特徵也是一組float類型特徵值。常規進行比對當然是進行遍歷，for循環走一遍；但是這個操作，在庫比較小的時候，表現還行；當人臉庫的量級上升到萬級或是之上，那麼單純的for循環就無法滿足了；這裏能做的優化點是，將特徵人臉庫視作一個大的矩陣，待比對的人臉特徵直接進行矩陣運算即可快速獲得結果；通常cpu加速可用openblas，GPU加速可使用cublas；同樣是矩陣運算，eg：sgemv（矩陣X向量）和sgemm（矩陣X矩陣）速度有比較明顯的差異（這裏矩陣認爲是多維向量）；這就引發一個需求：將待比對的特徵向量拼成一個大矩陣進行比對；以上視爲背景；

從需求引發的思考

比較明確的是，當併發量比較小的時候，因爲硬件的比較強悍，實際上並不需要拼成大矩陣的；只有當併發量很高的時候，那就需要對比對模塊核心算法進行重新設計了；然後一個概念突然飄到我腦後。。。。。

（用線程池啊。。。。）

線程池初探

既然要使用線程池，總是需要知道什麼是線程池，不看不知道，仔細看了下，這個簡直是爲了解決當前任務而生解決方法；一個完整的線程池包括三個部分：消費層，排隊層，生產層；這個屬於生產消費者模型；生產層負責向排隊序列中添加數據，消費層負責處理排列序列中的數據；

從上圖可以看出：

1：一般正常的線程池在初始化時會先啓動一定的線程數，這樣能保證線程數不會無限制增加；
2：細心的朋友可能看到，其中消費者層，是否終止是缺少一個箭頭的，一般建議是析構時進行線程終止；當然也可以提供一個原子鎖用於鎖用於控制線程的停止；
3：上面有一個比較關鍵的結構，同步隊列,負責兩方的通信和數據同步；

既然到了這一步，那麼一個簡易的線程池已經能夠實現了：

code 來源，侵刪：https://github.com/progschj/ThreadPool

#ifndef THREAD_POOL_H
#define THREAD_POOL_H
#include <vector>
#include <queue>
#include <memory>
#include <thread>
#include <mutex>
#include <condition_variable>
#include <future>
#include <functional>
#include <stdexcept>

class ThreadPool {
public:
    ThreadPool(size_t);     //initialization thread num
    template<class F, class... Args>
    auto enqueue(F&& f, Args&&... args) 
        -> std::future<typename std::result_of<F(Args...)>::type>;
    ~ThreadPool();
private:
    // need to keep track of threads so we can join them
    std::vector< std::thread > workers;
    // the task queue
    std::queue< std::function<void()> > tasks;
    
    // synchronization
    std::mutex queue_mutex;
    std::condition_variable condition;
    bool stop;
};
 
// the constructor just launches some amount of workers
inline ThreadPool::ThreadPool(size_t threads)
    :   stop(false)
{
    for(size_t i = 0;i<threads;++i)
        workers.emplace_back(
            [this]
            {
                for(;;)
                {
                    std::function<void()> task;

                    {
                        std::unique_lock<std::mutex> lock(this->queue_mutex);
                        this->condition.wait(lock,
                            [this]{ return this->stop || !this->tasks.empty(); });
                        if(this->stop && this->tasks.empty())
                            return;
                        task = std::move(this->tasks.front());
                        this->tasks.pop();
                    }

                    task();
                }
            }
        );
}

// add new work item to the pool
template<class F, class... Args>
auto ThreadPool::enqueue(F&& f, Args&&... args) 
    -> std::future<typename std::result_of<F(Args...)>::type>
{
    using return_type = typename std::result_of<F(Args...)>::type;

    auto task = std::make_shared< std::packaged_task<return_type()> >(
            std::bind(std::forward<F>(f), std::forward<Args>(args)...)
        );
        
    std::future<return_type> res = task->get_future();
    {
        std::unique_lock<std::mutex> lock(queue_mutex);

        // don't allow enqueueing after stopping the pool
        if(stop)
            throw std::runtime_error("enqueue on stopped ThreadPool");

        tasks.emplace([task](){ (*task)(); });
    }
    condition.notify_one();
    return res;
}

// the destructor joins all threads
inline ThreadPool::~ThreadPool()
{
    {
        std::unique_lock<std::mutex> lock(queue_mutex);
        stop = true;
    }
    condition.notify_all();
    for(std::thread &worker: workers)
        worker.join();
}
#endif

至此，線程池相關的基本介紹完了，完結撒花。。。

線程池再探

當然寫到這裏了都，怎麼可能就結束了，那我和其他人有什麼區別；認真看了上面的代碼，一種頭大之情，油然而生，道理我都懂，我還是覺得有點複雜；針對這樣的疑問，我想說，請繼續看下去。在最早的一幅圖中指出，其中比較關鍵的是同步隊列，隊列的處理還需要加鎖，對於加鎖這種東西我總是很慌，有沒有線程安全的隊列，這樣我就只需要考慮加任務和處理任務就行了。當然，這是有的，下面就是輪子：

concurrentqueue:一個支持多生產者，多消費者的無鎖隊列；

這個無鎖隊列能夠保證存取數據是線程安全的，它主要有如下優點：
- 啥都不需要，就需要引入一個頭文件即可；
- 數據類型和數量無限制；
- 異常安全，優秀的內存管理
- 支持超快的堆操作—>這不正是我需要的麼
對最開始的需求而言，有個比較重要的點是，我需要處理隊列的數據，並不是一個一個任務的讀取，我需要的是超快的堆操作，如果只是一個任務一個任務的讀寫操作，其中鎖狀態切換可能最終也會成爲瓶頸；
下面的代碼重點說明的就是：try_dequeue_bulk，一次從隊列中拿出一定數量的任務

ConcurrentQueue<int> q;
int dequeued[100] = { 0 };
std::thread threads[20];

// Producers
for (int i = 0; i != 10; ++i) {
	threads[i] = std::thread([&](int i) {
		int items[10];
		for (int j = 0; j != 10; ++j) {
			items[j] = i * 10 + j;
		}
		q.enqueue_bulk(items, 10);
	}, i);
}

// Consumers
for (int i = 10; i != 20; ++i) {
	threads[i] = std::thread([&]() {
		int items[20];
		for (std::size_t count = q.try_dequeue_bulk(items, 20); count != 0; --count) {
			++dequeued[items[count - 1]];
		}
	});
}

// Wait for all threads
for (int i = 0; i != 20; ++i) {
	threads[i].join();
}

// Collect any leftovers (could be some if e.g. consumers finish before producers)
int items[10];
std::size_t count;
while ((count = q.try_dequeue_bulk(items, 10)) != 0) {
	for (std::size_t i = 0; i != count; ++i) {
		++dequeued[items[i]];
	}
}

// Make sure everything went in and came back out!
for (int i = 0; i != 100; ++i) {
	assert(dequeued[i] == 1);
}

通過上面的例程是能夠將上一小節的線程池類接口給簡易優化的，因爲隊列的線程安全無需考慮的話，只需要處理具體的業務需求即可；至此，基本完結；

總結：

使用C++線程池，一定程度上使編寫併發程序變得簡單，可以使用簡單的互斥鎖和條件變量實現一個簡易的線程池，從而避免頻繁的創建線程；使用線程池，需要設置合理的線程數和隊列大小，個人在使用的時候，當隊列數設置過小，出現過死鎖，這種死鎖問題的原因有時候並不太好排查，當加大隊列後死鎖的情況沒了；這是一個隊列大小和QPS問題的權衡；

參考文獻和鏈接：

1：深入應用C++11，代碼工程級優化
2：C++併發實戰
3：https://github.com/cameron314/concurrentqueue
4：lock-free介紹:
5：thread-pool詳解

【C++相關】一個從提高QPS引發的問題

一個從提高QPS引發的問題

業務背景：

從需求引發的思考

線程池初探

線程池再探

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

[算法相關] 260. Single Number III -取出非重複數字（加強版）

[算法相關] 136. Single Number--取出非重複數字

[算法相關] Valid Sudoku-數獨判斷

[算法相關] 338. Counting Bits--位運算

238. Product of Array Except Self- 非自身數組的乘積

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結