Further Optimizations Exploiting the Microarchitecture of the Processor

In the prior essay Transforming an Abstract Program into More Efficient Code Systematically, we  applied optimizations that did not rely on any features of target machine. They simply reduced the overhead of procedure calls and eliminated some of the critical "Optimization blockers" that cause difficulties for optimizing compilers.

      As we seek to push the performance further, we must consider optimizations that exploit the microarchitecture of the processor, that is, the underlying system design by which a processor executes instructions[1].

      Firstly, loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration[1]. 

      Secondly, for a combining operation that associative and commutative, such as integer addition or multiplication, we can improve performance by splitting the set of combining operations into two or more parts and combining the results at the end.

      Thirdly, reassociation transformation is a way to break the sequential dependencies and thereby improve performance beyond the latency bound.

      Three versions of combining code combine5, combine6, combine7 using two-way loop unrolling, two-way parallelism and reassociation transformation separately are presented below:

// Unroll loop by 2                          // Unroll loop by 2, 2-way parallelism       // Change associativity of combining opration
void combine5(vec_ptr v, data_t *dest)       void combine6(vec_ptr v, data_t *dest)       void combine7(vec_ptr v, data_t *dest)
{										     { 										      {
    long int length = vec_length(v);             long int length = vec_length(v);             long int length = vec_length(v);
    long int limit = length-1;                   long int limit = length-1;                   long int limit = length-1;
    data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);             data_t *data = get_vec_start(v);
    data_t acc = IDENT;                          data_t acc0 = IDENT;                         data_t acc = IDENT;
                                                 data_t acc1 = IDENT;
    // Combine 2 elements at a time              // Combine 2 elements at a time              // Combine 2 elements at a time
    long int i = 0;                              long int i = 0;                              long int i = 0;
    for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {                   for (i=0; i<limit; i+=2) {
        acc = (acc OP data[i]) OP data[i+1];         acc0 = acc0 OP data[i];                      acc = acc OP (data[i] OP data[i+1]);
                                                     acc1 = acc1 OP data[i+1];
    }                                            }                                            }

    // Finish any remaining elements             // Finish any remaining elements             // Finish any remaining elements
    for (; i<length; ++i) {                      for (; i<length; ++i) {                      for (; i<length; ++i) {
        acc = acc OP data[i];                        acc0 = acc0 OP data[i];                      acc = acc OP data[i];
    }                                            }                                            }
    *dest = acc;                                 *dest = acc0 OP acc1;                        *dest = acc;
}                                            }                                            }

     Loop unrolling can improve performance in two ways. First, it reduces the number of operations that do not contribute directly to the program results, such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation.

      In the loop of combine6, we have two critical paths, one corresponding to computing the product of  even-numbered elements and one for the odd-numbered elements. Each of these critical paths contain only n/2 operations, thus leading to a CPE of L/2.

      As with the template for combine7, we have two load and two mul operations, but only one of the mul operations forms a data-dependency chain between loop registers. And we only have n/2 operations along with the critical path. As we increase the unrolling factor k, we continue to have only one operation per iteration along the critical path.

References

[1] Randal E. Bryant, David R. O'Hallaron(2011). COMPUTER SYSTEMS A Programmer's Perspective (Second Edition).Beijing: China Machine Press.


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章