CUDA学习-cdp快排实现（一次快排）

cdp快排实现（预备知识）：https://blog.csdn.net/shungry/article/details/90520554

理解了快排的主要原理就，接下来通过官方例子进行进一步的理解。为了更好的进行理解，把官方例子cdpAdvancedQuicksort拆解出一次快排的例子。

__global__ void qsort_warp(unsigned *indata,
                           unsigned *outdata,
                           unsigned int offset,
                           unsigned int len,
                           qsortAtomicData *atomicData,
                           qsortRingbuf *atomicDataStack,
                           unsigned int source_is_indata, //输出的值要不要也存在输入里
                           unsigned int depth)
{
    // Handle to thread block group
    cg::thread_block cta = cg::this_thread_block();
    // Find my data offset, based on warp ID
    unsigned int thread_id = threadIdx.x + (blockIdx.x << QSORT_BLOCKSIZE_SHIFT); 
    //unsigned int warp_id = threadIdx.x >> 5;   // Used for debug only
    unsigned int lane_id = threadIdx.x & (warpSize-1);//warp id

    // Exit if I'm outside the range of sort to be done
    if (thread_id >= len)
        return;

    // Read in the data and the pivot. Arbitrary pivot selection for now.
    unsigned pivot = indata[offset + len/2];
    unsigned data  = indata[offset + thread_id];

    cg::coalesced_group active = cg::coalesced_threads();
    unsigned int greater = (data > pivot);
	//比较比选取值（pivot）要大的数
    unsigned int gt_mask = active.ballot(greater);
	//ballot 调用 __ballot_sync 作用于线程束中的每个线程
	//__ballot_sync(0xFFFFFFFF, predicate)每个线程所在位 与mask(0xFFFFFFFF)做与运算 留下为1的

    if (gt_mask == 0) //说明所有的值都比pivot “<=”
    {
        greater = (data >= pivot);
        gt_mask = active.ballot(greater);    // Must re-ballot for adjusted comparator
    }

    unsigned int lt_mask = active.ballot(!greater);
    unsigned int gt_count = __popc(gt_mask);//计算64位整数中设置为1的位数。这里计算线程束中大于piovt的数量
    unsigned int lt_count = __popc(lt_mask);//计算64位整数中设置为1的位数。这里计算线程束中小于piovt的数量

    // Atomically adjust the lt_ and gt_offsets by this amount. Only one thread need do this. Share the result using shfl
    unsigned int lt_offset, gt_offset;

    if (lane_id == 0)  //线程束内偏移为0的线程
    {   //atomicAdd就是返回原有值，然后再在原地址上进行原子+ (类似i++)
        if (lt_count > 0)//设置偏移值
            lt_offset = atomicAdd((unsigned int *) &atomicData->lt_offset, lt_count);

        if (gt_count > 0)
            gt_offset = len - (atomicAdd((unsigned int *) &atomicData->gt_offset, gt_count) + gt_count);
    }

	//束洗牌 每个都通过线程0的接受偏移值
    lt_offset = active.shfl((int)lt_offset, 0);   // Everyone pulls the offsets from lane 0
    gt_offset = active.shfl((int)gt_offset, 0);

    unsigned lane_mask_lt;
	//获得线程在warp内的位置的掩码
	//此位置前的二进制都置1 如lane_id=5，所得的二进制：11111
    asm("mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt));
    unsigned int my_mask = greater ? gt_mask : lt_mask;
    unsigned int my_offset = __popc(my_mask & lane_mask_lt); 
	//计算64位整数中设置为1的位数。 这里计算的是在本线程前和它一样'>'或‘<=’pivot的个数

    // Move data.
    my_offset += greater ? gt_offset : lt_offset;
    outdata[offset + my_offset] = data;
}

这个代码是拆解出来的，程序设计成一个线程处理序列中的一个元素，这样的话充分的实现其并行性，充分利用GPU资源。

1.程序先确定线程id，线程warp内id以及在本id上的数值，其中pivot就是快排中的基准值，通过与pivot进行比较确定放在前面部分还是后面部分。比较容易看浑的代码就是 :

unsigned int greater = (data > pivot);
unsigned int gt_mask = active.ballot(greater);
    ......
int lt_mask = active.ballot(!greater);
unsigned int gt_count = __popc(gt_mask);//计算64位整数中设置为1的位数。这里计算线程束中大于piovt的数量
unsigned int lt_count = __popc(lt_mask);//计算64位整数中设置为1的位数。这里计算线程束中小于piovt的数量

ballot函数是对warp中所有线程传入的参数（greater）与0xFFFFFFFF进行位与操作，最后得出一个int类型的值。在这里就是表示>pivot的线程所在的位为1。ballot背后调用的是__ballot_sync(oxFFFFFFFF,greater)。所以gt_mask\lt_mask分别表示>pivot和<=pivot的warp内线程。

__popc是统计整数中1的个数，具体注释上有写，对于后面设置存放位置有帮助。

2.后面的代码就是warp中第一个线程对warp存放位置的offset进行设置（有lt、gt），并对下一个warp存放位置进行修改（保证另一个warp进行操作时不会在同一个内存地址，以及保证相对的有序），所以要使用原子操作。特别要注意的是atomicAdd函数是返回原有值，然后再在原地址上进行原子ADD (类似i++)。具体可以看atomicAdd函数：https://blog.csdn.net/shungry/article/details/90521592

3.后面通过束洗牌，通过每个warp的id0线程把上面算的偏移进行广播。这里的束洗牌总共有4种，有时间我再补上。

 lt_offset = active.shfl((int)lt_offset, 0);   // Everyone pulls the offsets from lane 0
 gt_offset = active.shfl((int)gt_offset, 0);

4.每个得到warp的偏移量之后，再根据自己所在的位置进行最后定位自己的位置并存放下来，后面的部分注释都写得比较清楚了。

这样一次排序就好了，由于中间通过计算来确定每个线程的存放下标，然后进行存储。除了少数的atomicAdd，大多数都是并行计算的，所以速度很快。

CUDA学习-cdp快排实现（一次快排）

开源高性能结构化日志模块NanoLog

【简写Mybatis-02】注册机的实现以及SqlSession处理

手绘二维码

.NET借助虚拟网卡实现一个简单异地组网工具

並行scan算法 Hillis Steels 和 Blelloch

2D凸包算法（六）：Incremental Method

2D凸包算法（五）：Divide and Conquer

【thrust庫】thrust::scatter 和 thrust::gather

2D凸包算法（一）：Jarvis' March （ Gift Wrapping Algorithm ）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結