Further Optimizations Exploiting the Microarchitecture of the Processor

In the prior essay Transforming an Abstract Program into More Efficient Code Systematically, we applied optimizations that did not rely on any features of target machine. They simply reduced the overhead of procedure calls and eliminated some of the critical "Optimization blockers" that cause difficulties for optimizing compilers.

As we seek to push the performance further, we must consider optimizations that exploit the microarchitecture of the processor, that is, the underlying system design by which a processor executes instructions[1].

Firstly, loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration[1].

Secondly, for a combining operation that associative and commutative, such as integer addition or multiplication, we can improve performance by splitting the set of combining operations into two or more parts and combining the results at the end.

Thirdly, reassociation transformation is a way to break the sequential dependencies and thereby improve performance beyond the latency bound.

Three versions of combining code combine5, combine6, combine7 using two-way loop unrolling, two-way parallelism and reassociation transformation separately are presented below:

// Unroll loop by 2                          // Unroll loop by 2, 2-way parallelism       // Change associativity of combining opration
void combine5(vec_ptr v, data_t *dest)       void combine6(vec_ptr v, data_t *dest)       void combine7(vec_ptr v, data_t *dest)
{										     { 										      {
    long int length = vec_length(v);             long int length = vec_length(v);             long int length = vec_length(v);
    long int limit = length-1;                   long int limit = length-1;                   long int limit = length-1;
    data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);
    data_t acc = IDENT;                          data_t acc0 = IDENT;                         data_t acc = IDENT;
                                                 data_t acc1 = IDENT;
    // Combine 2 elements at a time              // Combine 2 elements at a time              // Combine 2 elements at a time
    long int i = 0;                              long int i = 0;                              long int i = 0;
    for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {
        acc = (acc OP data[i]) OP data[i+1];         acc0 = acc0 OP data[i];                      acc = acc OP (data[i] OP data[i+1]);
                                                     acc1 = acc1 OP data[i+1];
    }                                            }                                            }

    // Finish any remaining elements             // Finish any remaining elements             // Finish any remaining elements
    for (; i<length; ++i) {                      for (; i<length; ++i) {                      for (; i<length; ++i) {
        acc = acc OP data[i];                        acc0 = acc0 OP data[i];                      acc = acc OP data[i];
    }                                            }                                            }
    *dest = acc;                                 *dest = acc0 OP acc1;                        *dest = acc;
}                                            }                                            }

Loop unrolling can improve performance in two ways. First, it reduces the number of operations that do not contribute directly to the program results, such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

In the loop of combine6, we have two critical paths, one corresponding to computing the product of even-numbered elements and one for the odd-numbered elements. Each of these critical paths contain only n/2 operations, thus leading to a CPE of L/2.

As with the template for combine7, we have two load and two mul operations, but only one of the mul operations forms a data-dependency chain between loop registers. And we only have n/2 operations along with the critical path. As we increase the unrolling factor k, we continue to have only one operation per iteration along the critical path.

References

[1] Randal E. Bryant, David R. O'Hallaron(2011). COMPUTER SYSTEMS A Programmer's Perspective (Second Edition).Beijing: China Machine Press.

Further Optimizations Exploiting the Microarchitecture of the Processor

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

Implementing a Recursive Procedure with IA32 and Y86 Assembly Code

An Application Program Dynamically Loading and Linking the Shared Library

Conflict Misses in Direct-Mapped Caches

Exhibiting Good Locality in Your Programs

Bit-level Difference between Float and Double

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結