确定整数是否在具有已知值集的两个整数(包括)之间的最快方法

本文翻译自:Fastest way to determine if an integer is between two integers (inclusive) with known sets of values

Is there a faster way than x >= start && x <= end in C or C++ to test if an integer is between two integers? 在C或C ++中是否有比x >= start && x <= end更快的方法来测试整数是否在两个整数之间?

UPDATE : My specific platform is iOS. 更新 :我的特定平台是iOS。 This is part of a box blur function that restricts pixels to a circle in a given square. 这是盒子模糊功能的一部分,它将像素限制为给定方块中的圆圈。

UPDATE : After trying the accepted answer , I got an order of magnitude speedup on the one line of code over doing it the normal x >= start && x <= end way. 更新 :在尝试接受的答案后 ,我在一行代码上获得了一个数量级的加速,而不是正常的x >= start && x <= end way。

UPDATE : Here is the after and before code with assembler from XCode: 更新 :这是来自XCode的汇编程序的after和before代码:

NEW WAY 新方法

// diff = (end - start) + 1
#define POINT_IN_RANGE_AND_INCREMENT(p, range) ((p++ - range.start) < range.diff)

Ltmp1313:
 ldr    r0, [sp, #176] @ 4-byte Reload
 ldr    r1, [sp, #164] @ 4-byte Reload
 ldr    r0, [r0]
 ldr    r1, [r1]
 sub.w  r0, r9, r0
 cmp    r0, r1
 blo    LBB44_30

OLD WAY 老路

#define POINT_IN_RANGE_AND_INCREMENT(p, range) (p <= range.end && p++ >= range.start)

Ltmp1301:
 ldr    r1, [sp, #172] @ 4-byte Reload
 ldr    r1, [r1]
 cmp    r0, r1
 bls    LBB44_32
 mov    r6, r0
 b      LBB44_33
LBB44_32:
 ldr    r1, [sp, #188] @ 4-byte Reload
 adds   r6, r0, #1
Ltmp1302:
 ldr    r1, [r1]
 cmp    r0, r1
 bhs    LBB44_36

Pretty amazing how reducing or eliminating branching can provide such a dramatic speed up. 非常惊人的是如何减少或消除分支可以提供如此惊人的速度。


#1楼

参考:https://stackoom.com/question/19jH2/确定整数是否在具有已知值集的两个整数-包括-之间的最快方法


#2楼

It depends on how many times you want to perform the test over the same data. 这取决于您希望对同一数据执行测试的次数。

If you are performing the test a single time, there probably isn't a meaningful way to speed up the algorithm. 如果您一次执行测试,可能没有一种有意义的方法来加速算法。

If you are doing this for a very finite set of values, then you could create a lookup table. 如果您为一组非常有限的值执行此操作,则可以创建查找表。 Performing the indexing might be more expensive, but if you can fit the entire table in cache, then you can remove all branching from the code, which should speed things up. 执行索引可能会更昂贵,但如果您可以将整个表放在缓存中,那么您可以从代码中删除所有分支,这样可以加快速度。

For your data the lookup table would be 128^3 = 2,097,152. 对于您的数据,查找表将是128 ^ 3 = 2,097,152。 If you can control one of the three variables so you consider all instances where start = N at one time, then the size of the working set drops down to 128^2 = 16432 bytes, which should fit well in most modern caches. 如果你可以控制三个变量中的一个,那么你可以考虑一次start = N所有实例,那么工作集的大小下降到128^2 = 16432字节,这应该适合大多数现代缓存。

You would still have to benchmark the actual code to see if a branchless lookup table is sufficiently faster than the obvious comparisons. 您仍然需要对实际代码进行基准测试,以查看无分支查找表是否比明显的比较快得多。


#3楼

There's an old trick to do this with only one comparison/branch. 只用一个比较/分支就可以做到这一点。 Whether it'll really improve speed may be open to question, and even if it does, it's probably too little to notice or care about, but when you're only starting with two comparisons, the chances of a huge improvement are pretty remote. 它是否能真正提高速度可能会受到质疑,即使它确实如此,它可能太少注意或不关心,但当你只是开始两次比较时,巨大改进的可能性非常小。 The code looks like: 代码如下:

// use a < for an inclusive lower bound and exclusive upper bound
// use <= for an inclusive lower bound and inclusive upper bound
// alternatively, if the upper bound is inclusive and you can pre-calculate
//  upper-lower, simply add + 1 to upper-lower and use the < operator.
    if ((unsigned)(number-lower) <= (upper-lower))
        in_range(number);

With a typical, modern computer (ie, anything using twos complement), the conversion to unsigned is really a nop -- just a change in how the same bits are viewed. 对于典型的现代计算机(即使用二进制补码的任何东西),转换为无符号实际上是一个不必要的 - 只是改变了相同位的查看方式。

Note that in a typical case, you can pre-compute upper-lower outside a (presumed) loop, so that doesn't normally contribute any significant time. 请注意,在典型情况下,您可以在(假定的)循环外预先计算upper-lower因此通常不会贡献任何重要时间。 Along with reducing the number of branch instructions, this also (generally) improves branch prediction. 随着减少分支指令的数量,这也(通常)改进了分支预测。 In this case, the same branch is taken whether the number is below the bottom end or above the top end of the range. 在这种情况下,无论数字是低于底端还是高于范围的顶端,都会采用相同的分支。

As to how this works, the basic idea is pretty simple: a negative number, when viewed as an unsigned number, will be larger than anything that started out as a positive number. 至于它是如何工作的,基本思路非常简单:当被视为无符号数时,负数将大于以正数开头的任何数字。

In practice this method translates number and the interval to the point of origin and checks if number is in the interval [0, D] , where D = upper - lower . 在实践中,此方法将number和间隔转换为原点,并检查number是否在区间[0, D] ,其中D = upper - lower If number below lower bound: negative , and if above upper bound: larger than D . 如果number低于下限: 负数 ,如果高于上限: 大于D


#4楼

It's rare to be able to do significant optimizations to code on such a small scale. 能够对如此小规模的代码进行重要优化是很少见的。 Big performance gains come from observing and modifying the code from a higher level. 从更高级别观察和修改代码可以获得巨大的性能提升。 You may be able to eliminate the need for the range test altogether, or only do O(n) of them instead of O(n^2). 您可以完全消除对范围测试的需要,或者仅执行O(n)而不是O(n ^ 2)。 You may be able to re-order the tests so that one side of the inequality is always implied. 您可以重新排序测试,以便始终隐含不平等的一面。 Even if the algorithm is ideal, gains are more likely to come when you see how this code does the range test 10 million times and you find a way to batch them up and use SSE to do many tests in parallel. 即使算法是理想的,当您看到此代码如何进行1000万次范围测试并且您找到一种方法来批量处理并使用SSE并行执行多个测试时,更有可能获得增益。


#5楼

Is it not possible to just perform a bitwise operation on the integer? 是不是只能对整数执行按位运算?

Since it has to be between 0 and 128, if the 8th bit is set (2^7) it is 128 or more. 由于它必须在0到128之间,如果第8位被设置(2 ^ 7),则它是128或更多。 The edge case will be a pain, though, since you want an inclusive comparison. 然而,边缘情况将是一个痛苦,因为你想要一个包容性的比较。


#6楼

This answer is to report on a testing done with the accepted answer. 这个答案是报告用已接受的答案完成的测试。 I performed a closed range test on a large vector of sorted random integer and to my surprise the basic method of ( low <= num && num <= high) is in fact faster than the accepted answer above! 我对一个排序随机整数的大向量进行了一个封闭范围测试,令我惊讶的是(低<= num && num <=高)的基本方法实际上比上面接受的答案更快! Test was done on HP Pavilion g6 (AMD A6-3400APU with 6GB ram. Here's the core code used for testing: 在HP Pavilion g6(AMD A6-3400APU,6GB内存)上进行了测试。以下是用于测试的核心代码:

int num = rand();  // num to compare in consecutive ranges.
chrono::time_point<chrono::system_clock> start, end;
auto start = chrono::system_clock::now();

int inBetween1{ 0 };
for (int i = 1; i < MaxNum; ++i)
{
    if (randVec[i - 1] <= num && num <= randVec[i])
        ++inBetween1;
}
auto end = chrono::system_clock::now();
chrono::duration<double> elapsed_s1 = end - start;

compared with the following which is the accepted answer above: 与以下是上面接受的答案相比:

int inBetween2{ 0 };
for (int i = 1; i < MaxNum; ++i)
{
    if (static_cast<unsigned>(num - randVec[i - 1]) <= (randVec[i] - randVec[i - 1]))
        ++inBetween2;
}

Pay attention that randVec is a sorted vector. 注意randVec是一个有序矢量。 For any size of MaxNum the first method beats the second one on my machine! 对于任何大小的MaxNum,第一种方法胜过我机器上的第二种方法!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章