CUDA學習-cdp快排實現（一次快排）

cdp快排實現（預備知識）：https://blog.csdn.net/shungry/article/details/90520554

理解了快排的主要原理就，接下來通過官方例子進行進一步的理解。爲了更好的進行理解，把官方例子cdpAdvancedQuicksort拆解出一次快排的例子。

__global__ void qsort_warp(unsigned *indata,
                           unsigned *outdata,
                           unsigned int offset,
                           unsigned int len,
                           qsortAtomicData *atomicData,
                           qsortRingbuf *atomicDataStack,
                           unsigned int source_is_indata, //輸出的值要不要也存在輸入裏
                           unsigned int depth)
{
    // Handle to thread block group
    cg::thread_block cta = cg::this_thread_block();
    // Find my data offset, based on warp ID
    unsigned int thread_id = threadIdx.x + (blockIdx.x << QSORT_BLOCKSIZE_SHIFT); 
    //unsigned int warp_id = threadIdx.x >> 5;   // Used for debug only
    unsigned int lane_id = threadIdx.x & (warpSize-1);//warp id

    // Exit if I'm outside the range of sort to be done
    if (thread_id >= len)
        return;

    // Read in the data and the pivot. Arbitrary pivot selection for now.
    unsigned pivot = indata[offset + len/2];
    unsigned data  = indata[offset + thread_id];

    cg::coalesced_group active = cg::coalesced_threads();
    unsigned int greater = (data > pivot);
	//比較比選取值（pivot）要大的數
    unsigned int gt_mask = active.ballot(greater);
	//ballot 調用 __ballot_sync 作用於線程束中的每個線程
	//__ballot_sync(0xFFFFFFFF, predicate)每個線程所在位 與mask(0xFFFFFFFF)做與運算 留下爲1的

    if (gt_mask == 0) //說明所有的值都比pivot “<=”
    {
        greater = (data >= pivot);
        gt_mask = active.ballot(greater);    // Must re-ballot for adjusted comparator
    }

    unsigned int lt_mask = active.ballot(!greater);
    unsigned int gt_count = __popc(gt_mask);//計算64位整數中設置爲1的位數。這裏計算線程束中大於piovt的數量
    unsigned int lt_count = __popc(lt_mask);//計算64位整數中設置爲1的位數。這裏計算線程束中小於piovt的數量

    // Atomically adjust the lt_ and gt_offsets by this amount. Only one thread need do this. Share the result using shfl
    unsigned int lt_offset, gt_offset;

    if (lane_id == 0)  //線程束內偏移爲0的線程
    {   //atomicAdd就是返回原有值，然後再在原地址上進行原子+ (類似i++)
        if (lt_count > 0)//設置偏移值
            lt_offset = atomicAdd((unsigned int *) &atomicData->lt_offset, lt_count);

        if (gt_count > 0)
            gt_offset = len - (atomicAdd((unsigned int *) &atomicData->gt_offset, gt_count) + gt_count);
    }

	//束洗牌 每個都通過線程0的接受偏移值
    lt_offset = active.shfl((int)lt_offset, 0);   // Everyone pulls the offsets from lane 0
    gt_offset = active.shfl((int)gt_offset, 0);

    unsigned lane_mask_lt;
	//獲得線程在warp內的位置的掩碼
	//此位置前的二進制都置1 如lane_id=5，所得的二進制：11111
    asm("mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt));
    unsigned int my_mask = greater ? gt_mask : lt_mask;
    unsigned int my_offset = __popc(my_mask & lane_mask_lt); 
	//計算64位整數中設置爲1的位數。 這裏計算的是在本線程前和它一樣'>'或‘<=’pivot的個數

    // Move data.
    my_offset += greater ? gt_offset : lt_offset;
    outdata[offset + my_offset] = data;
}

這個代碼是拆解出來的，程序設計成一個線程處理序列中的一個元素，這樣的話充分的實現其並行性，充分利用GPU資源。

1.程序先確定線程id，線程warp內id以及在本id上的數值，其中pivot就是快排中的基準值，通過與pivot進行比較確定放在前面部分還是後面部分。比較容易看渾的代碼就是 :

unsigned int greater = (data > pivot);
unsigned int gt_mask = active.ballot(greater);
    ......
int lt_mask = active.ballot(!greater);
unsigned int gt_count = __popc(gt_mask);//計算64位整數中設置爲1的位數。這裏計算線程束中大於piovt的數量
unsigned int lt_count = __popc(lt_mask);//計算64位整數中設置爲1的位數。這裏計算線程束中小於piovt的數量

ballot函數是對warp中所有線程傳入的參數（greater）與0xFFFFFFFF進行位與操作，最後得出一個int類型的值。在這裏就是表示>pivot的線程所在的位爲1。ballot背後調用的是__ballot_sync(oxFFFFFFFF,greater)。所以gt_mask\lt_mask分別表示>pivot和<=pivot的warp內線程。

__popc是統計整數中1的個數，具體註釋上有寫，對於後面設置存放位置有幫助。

2.後面的代碼就是warp中第一個線程對warp存放位置的offset進行設置（有lt、gt），並對下一個warp存放位置進行修改（保證另一個warp進行操作時不會在同一個內存地址，以及保證相對的有序），所以要使用原子操作。特別要注意的是atomicAdd函數是返回原有值，然後再在原地址上進行原子ADD (類似i++)。具體可以看atomicAdd函數：https://blog.csdn.net/shungry/article/details/90521592

3.後面通過束洗牌，通過每個warp的id0線程把上面算的偏移進行廣播。這裏的束洗牌總共有4種，有時間我再補上。

 lt_offset = active.shfl((int)lt_offset, 0);   // Everyone pulls the offsets from lane 0
 gt_offset = active.shfl((int)gt_offset, 0);

4.每個得到warp的偏移量之後，再根據自己所在的位置進行最後定位自己的位置並存放下來，後面的部分註釋都寫得比較清楚了。

這樣一次排序就好了，由於中間通過計算來確定每個線程的存放下標，然後進行存儲。除了少數的atomicAdd，大多數都是並行計算的，所以速度很快。

CUDA學習-cdp快排實現（一次快排）

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

並行scan算法 Hillis Steels 和 Blelloch

2D凸包算法（六）：Incremental Method

2D凸包算法（五）：Divide and Conquer

【thrust庫】thrust::scatter 和 thrust::gather

2D凸包算法（一）：Jarvis' March （ Gift Wrapping Algorithm ）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結