本文主要是通過分析五級流水及流水線互鎖的原理,從而可以編寫出更加高效的彙編代碼。
1. ARM9五級流水線
ARM7採用的是典型的三級流水線結構,包括取指、譯碼和執行三個部分。其中執行單元完成了大量的工作,包括與操作數相關的寄存器和存儲器讀寫操作、ALU操作及相關器件之間的數據傳輸。這三個階段每個階段一般會佔用一個時鐘週期,但是三條指令同時進行三級流水的三個階段的話,還是可以達到每個週期一條指令的。但執行單元往往會佔用多個時鐘週期,從而成爲系統性能的瓶頸。
ARM9採用了更高效的五級流水線設計,在取指、譯碼、執行之後增加了LS1和LS2階段,LS1負責加載和存儲指令中指定的數據,LS2負責提取、符號擴展通過字節或半字加載命令加載的數據。但是LS1和LS2僅對加載和存儲命令有效,其它指令不需要執行這兩個階段。下面是ARM官方文檔的定義:
-
Fetch: Fetch from memory the instruction at addresspc. The instruction is loaded intothe core and then processes down the core pipeline.
-
Decode: Decode the instruction that was fetched in the previous cycle. The processoralso reads the input operands from the register bank if they are not available via one ofthe forwarding paths.
-
ALU: Executes the instruction that was decoded in the previous cycle. Note this instruc-tion was originally fetched from addresspc−8 (ARM state) orpc−4 (Thumb state).Normally this involves calculating the answer for a data processing operation, or theaddress for a load, store, or branch operation. Some instructions may spend severalcycles in this stage. For example, multiply and register-controlled shift operations takeseveral ALU cycles.
-
LS1: Load or store the data specified by a load or store instruction. If the instruction isnot a load or store, then this stage has no effect.
-
LS2: Extract and zero- or sign-extend the data loaded by a byte or halfword loadinstruction. If the instruction is not a load of an 8-bit byte or 16-bit halfword item,then this stage has no effect.
2. 流水線互鎖問題
LDR r1, [r2, #4]
ADD r0, r0, r1
上面這段代碼就需要佔用三個時鐘週期,因爲LDR指令在ALU階段會去計算r2+4的值,這時ADD指令還在譯碼階段,而這一個時鐘週期還完不成從[r2, #4]內存中取出數據並回寫到r1寄存器中,到下一個時鐘週期的時候ADD指令的ALU需要用到r1,但是它還沒有準備好,這時候pipeline就會把ADD指令stall停止,等待LDR指令的LS1階段完成,然後纔會行進到ADD指令的ALU階段。下圖表示了上面例子中流水線互鎖的情況:LDRB r1, [r2, #1]
ADD r0, r0, r2
EOR r0, r0, r1
上面的代碼需要佔用四個時鐘週期,因爲LDRB指令完成對r1的回寫需要在LS2階段完成後(它是byte字節加載指令),所以EOR指令需要等待一個時鐘週期。流水線運行情況如下圖:再看下面例子:
MOV r1, #1
B case1
AND r0, r0, r1 EOR r2, r2, r3 ...
case1:
SUB r0, r0, r1
上面代碼需要佔用五個時鐘週期,一條B指令就要佔用三個時鐘週期,因爲遇到跳轉指令就會去清空pipeline後面的指令,到新的地址去重新取指。流水線運行情況如下圖:3. 避免流水線互鎖以提高運行效率
void str_tolower(char *out, char *in)
{
unsigned int c;
do {
c = *(in++);
if (c>=’A’ && c<=’Z’)
{
c = c + (’a’ -’A’);
}
*(out++) = (char)c;
} while (c);
}
編譯器生成下面彙編代碼:str_tolower
LDRB r2,[r1],#1 ; c = *(in++)
SUB r3,r2,#0x41 ; r3=c-‘A’
CMP r3,#0x19 ; if (c <=‘Z’-‘A’)
ADDLS r2,r2,#0x20 ; c +=‘a’-‘A’
STRB r2,[r0],#1 ; *(out++) = (char)c
CMP r2,#0 ; if (c!=0)
BNE str_tolower ; goto str_tolower
MOV pc,r14 ; return
其中(c >= 'A' && c <= 'Z')條件判斷編譯成彙編以後變型成了0 <= c - 'A' <= 'Z' - 'A'。3.1 Load Scheduling by Preloading
out RN 0 ; pointer to output string
in RN 1 ; pointer to input string
c RN 2 ; character loaded
t RN 3 ; scratch register
; void str_tolower_preload(char *out, char *in)
str_tolower_preload
LDRB c, [in], #1 ; c = *(in++)
loop
SUB t, c, #’A’ ; t = c-’A’
CMP t, #’Z’-’A’ ; if (t <= ’Z’-’A’)
ADDLS c, c, #’a’-’A’ ; c += ’a’-’A’;
STRB c, [out], #1 ; *(out++) = (char)c;
TEQ c, #0 ; test if c==0
LDRNEB c, [in], #1 ; if (c!=0) { c=*in++;
BNE loop ; goto loop; }
MOV pc, lr ; return
這個版本的彙編比C編譯器編譯出來的彙編多了一條指令,但是卻省了2個時鐘週期,將循環的時鐘週期從每個字符11個降到了9個,效率是C編譯版本的1.22倍。3.2 Load Scheduling by Unrolling
out RN 0 ; pointer to output string
in RN 1 ; pointer to input string
ca0 RN 2 ; character 0
t RN 3 ; scratch register
ca1 RN 12 ; character 1
ca2 RN 14 ; character 2
; void str_tolower_unrolled(char *out, char *in)
str_tolower_unrolled
STMFD sp!, {lr} ; function entry
loop_next3
LDRB ca0, [in], #1 ; ca0 = *in++;
LDRB ca1, [in], #1 ; ca1 = *in++;
LDRB ca2, [in], #1 ; ca2 = *in++;
SUB t, ca0, #’A’ ; convert ca0 to lower case
CMP t, #’Z’-’A’
ADDLS ca0, ca0, #’a’-’A’
SUB t, ca1, #’A’ ; convert ca1 to lower case
CMP t, #’Z’-’A’
ADDLS ca1, ca1, #’a’-’A’
SUB t, ca2, #’A’ ; convert ca2 to lower case
CMP t, #’Z’-’A’
ADDLS ca2, ca2, #’a’-’A’
STRB ca0, [out], #1 ; *out++ = ca0;
TEQ ca0, #0 ; if (ca0!=0)
STRNEB ca1, [out], #1 ; *out++ = ca1;
TEQNE ca1, #0 ; if (ca0!=0 && ca1!=0)
STRNEB ca2, [out], #1 ; *out++ = ca2;
TEQNE ca2, #0 ; if (ca0!=0 && ca1!=0 && ca2!=0)
BNE loop_next3 ; goto loop_next3;
LDMFD sp!, {pc} ; return;
上面的代碼是目前位置我們實驗出的最高效的實現。此方法對於每個字符的處理只需要7個時鐘週期,效率是C編譯版本的1.57倍。