【彙編優化】之arm32彙編優化


  本文介紹arm架構32位neon彙編優化,適合於任何基礎。
  溫馨提醒:嵌入式設備(即arm架構的板子)在編譯時,最好加上 -fsigned-char 因爲嵌入式設備默認類型爲unsigned char類型,非char 類型。此外在編譯arm彙編優化代碼時,編譯選項需要加上-c 。

1. 初識arm語法

arm純彙編語法分armasm語法 和 gnu asm語法, 本文基於gnu asm語法討論。

1.1 常用語法

(1)定義一個函數

	.text
	.align  4
	.global     name
	.type       %function
name:

     FUNCTION STATEMENT

     bx lr	

(2) 定義一個宏代碼

   .macro  name arg1, arg2, arg3
		ldr        r0,            \arg1
		vstl.u32   \arg2\()[0],  [r0]
   .endm

(3) 打印宏代碼中的變量

   .macro  name arg1, arg2, arg3
		ldr        r0,            \arg1
		vstl.u32   \arg2\()[0],  [r0]
		.altmacro 
        .warning    "%(\arg1, \arg2, \arg3)"
	    .noaltmacro
   .endm

本示例意在告訴,宏參數可以通過 \ 來取, 針對特殊的需要用 \() 來分隔,假設arg2是d0寄存器,如果需要將d0[0]裏面的數據存儲到r0中,就不能用 \arg2[0] 來獲取,編譯器會認爲是解析宏參數arg2[0]。
  
 (3).ltorg的使用
  在代碼中,如果常量區跟代碼區距離相隔太大,當前函數需要訪問常量區的某個常量,則需要在當前函數開頭前,上一函數結尾後,添加.ltorg,否則編譯會提示相應的錯誤。

.ltorg Insert the literal pool of constants at this point in the program. The literal pool is used by the ldr = and adrl assembly language pseudo-instructions and is specific to the ARM. Using this assembler directive is almost always optional, as the GNU Assembler is smart enough to figure out when and where to put any literal pool.However, there are situations when it is very useful to include this directive, such as when you need absolute control over where the assembler places your code.

(4) 註釋代碼
 溫馨提示:註釋雖然有多種形式,但爲了便於將arm32的優化代碼轉譯爲arm64的優化代碼,註釋最好採用 “//” 或“/* */”的形式,因爲arm64彙編不支持以@開頭的註釋。

	 Inline comment char:  ‘@’
	 Line comment char: ‘#’
	 Statement separator: ‘;’

或者 (使用/* */ 註釋多行;使用//註釋單行,但是//的使用,需要文件的後綴爲.S)

There are two basic comment styles: multi-line and single-line. Multi-line comments start with / ∗ and everything is ignored until a matching sequence of ∗ / is found. These comments are exactly the same as multi-line comments in C and C++. In ARM assembly, single line comments begin with an @ character, and all remaining characters on the line are ignored. Listing 2.1 shows both types of comment. In addition, if the name of the file ends in .S , then single line comments can begin with // . If the file name does not end with a capital .S , then the // syntax is not allowed.

gnu語法總體介紹:https://www.eecs.umich.edu/courses/eecs373/readings/Assembler.pdf
gnu語法快速入門:http://www.ic.unicamp.br/~celio/mc404-2014/docs/gnu-arm-directives.pdf
gnu常用語法速查:http://www.coranac.com/files/gba/re-ejected-gasref.pdf

2. arm 32位架構寄存器介紹

2.1 arm寄存器

arm寄存器有16個32位的通用寄存器(R0-R15),寄存器列表如圖3-5所示,需注意的是:R14(LR)用來存儲調用子例程時的返回地址、R0~R3被用來傳遞函數形參、其它的寄存器如果在被調用者函數中使用,則需要進行Push操作,但是R12寄存器比較特殊,在被調用者函數中使用時可以不用push;關於更詳細的調用規則參考ATPCS(2參考網址,5.1.1 Core registers), ATPCS採用滿降序堆棧(STMFD/LDMFD)。

1參考網址:http://home.deib.polimi.it/agosta/lib/exe/fetch.php?id=teaching%3Ainfo1tlc&cache=cache&media=teaching:asm_guide.pdf
2參考網址:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf
備註:浮點型參數通過s0-s15(d0-d7,q0-q3)傳遞。

2.2 neon寄存器

neon技術第一次實現是在ARM Cortex-A8處理器,ARMv7架構體系(ARMv7-A與ARMv7-R系列)上;neon寄存器有16個128位的Q寄存器, 32個64位的D寄存器(摘自1參考網址, 5.1.1 Core registers),寄存器列表如圖A2-1所示(摘自2參考網址, A2.6.1 Advanced SIMD and VFP extension registers),需注意的是:S0是D0的低32位,S1是D0的高32位,同理D0是Q0的低64位,D1是Q0的高64位;S、D、Q寄存器之間的關係爲:
  The mapping between the registers is as follows:
     • S<2n> maps to the least significant half of D
     • S<2n+1> maps to the most significant half of D
     • D<2n> maps to the least significant half of Q
     • D<2n+1> maps to the most significant half of Q.
  For example, you can access the least significant half of the elements of a vector in Q6 by referring to D12,
and the most significant half of the elements by referring to D13.

1參考網址:http://infocenter.arm.com/help/topic/com.arm.doc.dht0002a/DHT0002A_introducing_neon.pdf
2參考網址:http://vision.gel.ulaval.ca/~jflalonde/cours/1001/h17/docs/ARM_v7.pdf

注意: (d8-d15, q4-q7) 在子程序中使用時,需要壓棧保存。參考網址:http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042f/IHI0042F_aapcs.pdf 5.1.2.1 VFP register usage conventions (VFP v2, v3 and the Advanced SIMD Extension)

2.3 NEON指令集
2.3.1 ARMv7/AArch32指令格式

所有的支持NEON指令都有一個助記符V,下面以32位指令爲例,說明指令的一般格式(參考1參考網址,Armv7-A/AArch32 instruction syntax):

V{<mod>}<op>{<shape>}{<cond>}{.<dt>}{<dest>}, src1, src2
  • < mod>:

    • Q: The instruction uses saturating arithmetic, so that the result is saturated within the range of the specified data type, such as VQABS, VQSHL etc.
    • H: The instruction will halve the result. It does this by shifting right by one place (effectively a divide by two with truncation), such as VHADD, VHSUB.
    • D: The instruction doubles the result, such as VQDMULL, VQDMLAL, VQDMLSL and VQ{R}DMULH.
    • R: The instruction will perform rounding on the result, equivalent to adding 0.5 to the result before truncating, such as VRHADD, VRSHR.
  • < op>: the operation (for example, ADD, SUB, MUL).
  • < cond>: Condition, used with IT instruction.
  • < .dt>: Data type, such as s8, u8, f32 etc.
  • < dest>: Destination.
  • < src1>: Source operand 1.
  • < src2>: Source operand 2.
  • < shape>: Shape,即NEON數據處理類型Long (L), Wide (W), Narrow (N)。

NEON數據處理類型可分爲Normal、Long、Wide、Narrow:
- Normal instructions can operate on any vector types, and produce result vectors the same size, and usually the same type, as the operand vectors.
- Long instructions operate on doubleword vector operands and produce a quadword vector result.(操作雙字vectors,生成四倍長字vectors) The result elements are usually twice the width of the operands, and of the same type.(結果的寬度一般比操作數加倍,同類型) Long instructions are specified using an L appended to the instruction.(在指令中加L)
- Wide instructions operate on a doubleword vector operand and a quadword vector operand, producing a quadword vector result.(操作雙字 + 四倍長字,生成四倍長字) The result elements and the first operand are twice the width of the second operand elements.(結果和第一個操作數都是第二個操作數的兩倍寬度) Wide instructions have a W appended to the instruction.(在指令中加W)
- Narrow instructions operate on quadword vector operands, and produce a doubleword vector result.(操作四倍長字,生成雙字) The result elements are usually half the width of the operand elements.(結果寬度一般是操作數的一半) Narrow instructions are specified using an N appended to the instruction.(在指令中加N)

1參考網址:https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

3. arm 32位架構指令手冊

3.1 手冊
3.1.1 中文手冊

下載地址:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0013d/index.html

   閱讀建議:最好將其第2章節、第3章節閱讀一遍,有助於對arm基本知識的掌握,真正寫彙編優化時,參考中文手冊比較夠,但是對於模糊不清的就要到 《3.2 英文手冊》 中查找更爲詳細的解釋。

3.1.2 英文手冊

下載地址(需要註冊):http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0406c/
 下載地址(不需要註冊):https://static.docs.arm.com/ddi0406/cd/DDI0406C_d_armv7ar_arm.pdf

3.1.2 Programmer’s Guide

下載地址:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html

4. NEON優化技巧

  1. Skill1: 減少數據之間的依賴

在ARMv7-A平臺上,爲了減少指令延時時間,應當避免使用當前指令的目的寄存器作爲下一條指令的源寄存器。英文原文:

On the ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.

  1. Skill2: 減少指令分支

NEON指令集沒有jump指令跳轉分支;當彙編代碼中需要使用分支跳轉時,使用的是ARM跳轉指令Jump。 在ARM處理器中,分支預測技術的使用非常廣泛。但是一旦分支預測失敗,代價相當大。 因此在彙編優化中儘量少用分支跳轉指令。英文原文:

There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.

  1. Skill3: 預裝載指令PLD的使用
       
       ARM處理器是load/store系統, 除了加載和存儲指令,其他的操作都是針對寄存器。提高加載和存儲指令的命中率對優化程序很重要。
       預裝載指令允許處理器發送信號給內存系統,告訴內存系統此處裝在的數據在將來可能要用。如果數據被正確的預裝載到了cache中,對於提高cache的命中率很有用,命中率提高了,性能也就提高了。但是如果沒有預裝載正確,將會降低性能。英文原文:

ARM processors are a load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.
  Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.

. Skill4: Misc
   
   在ARM NEON編程裏面,不同的指令序列能實現同樣的操作;但是更少的指令並不總是意味着更好的性能。這基於在特定情況下的benchmark and profiling result(基準和分析結果),如下就是一些特定情況下的實踐分析。

Floating-point VMLA/VMLS instruction

通常,VMUL+VADD/VMUL+VSUB指令能夠被VMLA/VMLS指令替換,因爲指令數量更少了,更精簡了。但是,對比於浮點VMUL操作,浮點VMLA/VMLS操作有更長的指令delay,假如在這段delay空隙中沒有其他的指令能夠插入的話,使用浮點VMUL+VADD/VMUL+VSUB操作將會表現出更好的性能。

1參考網址:https://community.arm.com/android-community/b/android/posts/arm-neon-optimization

5. 調試優化代碼

彙編代碼中添加如下代碼(即.S文件中)

.macro print_m in1=r0, in2=d0
       push {r0-r3, lr}
	   vstl.u64       {\in2\()}, [\in1\()]
	   mov     r0, \in1
       bl cprintf
       pop {r0-r3, pc}
.endm
  注意:in1應該是表示內存的arm寄存器, in2表示NEON寄存器如D0。   C文件中添加如下代碼
void cprintf(unsigned char *srcu8)
{
  int i=0;
  char *srcs8 = (char *)srcu8;
  for(i=0; i < 16; i++){
       printf("%d ", srcu8[i])
  }
  for(i=0; i < 16; i++){
      printf("%d ", srcs8[i])
  }
  printf("\n");
}

參考網址:https://people.cs.clemson.edu/~rlowe/cs2310/notes/debugging-with-printf.pdf

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章