Neon intrinsics

1.介紹

在上篇中，介紹了ARM的Neon，本篇主要介紹Neon intrinsics的函數用法，也就是assembly之前的用法。NEON指令是從Armv7架構開始引入的SIMD指令，其共有16個128位寄存器。發展到最新的Arm64架構，其寄存器數量增加到32個，但是其長度仍然爲最大128位，因此操作上並沒有發生顯著的變化。對於這樣的寄存器，因爲可以同時存儲並處理多組數據，稱之爲向量寄存器。Intrinsics是使用C語言的方式對NEON寄存器進行操作，因爲相比於傳統的使用純彙編語言，具有可讀性強，開發速度快等優勢。如果需要在代碼中調用NEON Intrinsics函數，需要加入頭文件"arm_neon.h"。關於neon的所有函數，可以參考官網的：ARM NEON intrinsics reference 這裏將網上的Neon intrinsics的函數做個總結：

1.1 指令的分類

正常指令：生成大小相同且類型通常與操作數向量相同的結果向量
長指令：對雙字節向量操作數執行運算，生成四字向量的結果，所生成的元素一般是操作數元素寬度的兩倍
寬指令：一個雙字向量操作數和一個四字向量操作數執行運算，生成四字向量結果，所生成的元素和第一個操作數的元素是第二個操作數元素寬度的兩倍
窄指令：四字向量操作數執行運算，並生成雙字向量結果，所生成的元素一般是操作數元素寬度的一半
飽和指令：當超過數據類型指定的範圍則自動限制在該範圍內

示例1:

int16x8_t vqaddq_s16 (int16x8_t, int16x8_t)
int16x4_t vqadd_s16 (int16x4_t, int16x4_t)

第一個字母'v'指明是vector向量指令，也就是NEON指令；
第二個字母'q'指明是飽和指令，即後續的加法結果會自動飽和；
第三個字段'add'指明是加法指令；
第四個字段'q'指明操作寄存器寬度，爲'q'時操作QWORD, 爲128位；未指明時操作寄存器爲DWORD，爲64位；
第五個字段's16'指明操作的基本單元爲有符號16位整數，其最大表示範圍爲-32768 ~ 32767；
形參和返回值類型約定與C語言一致。

其它可能用到的助記符包括:

l 長指令，數據擴展
w 寬指令，數據對齊
n 窄指令, 數據壓縮

關於所有函數的分類，請參考博客：https://blog.csdn.net/hemmingway/article/details/44828303

1.2 數據類型

NEON Intrinsics內置的整數數據類型主要包括以下幾種:

(u)int8x8_t;
(u)int8x16_t;
(u)int16x4_t;
(u)int16x8_t;
(u)int32x2_t;
(u)int32x4_t;
(u)int64x1_t;

其中，第一個數字代表的是數據類型寬度爲8/16/32/64位，第二個數字代表的是一個寄存器中該類型數據的數量。如int16x8_t代表16位有符號數，寄存器中共有8個數據。

2.Syntax

2.1 Arithmetic

add: vaddq_f32 or vaddq_f64 ( sum = v1 + v2 )

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 };
float32x4_t sum = vaddq_f32(v1, v2);
// => sum = { 2.0, 3.0, 4.0, 5.0 }

multiply: vmulq_f32 or vmulq_f64 ( sum = v1 + v2 )

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 };
float32x4_t prod = vmulq_f32(v1, v2);
// => prod = { 1.0, 2.0, 3.0, 4.0 }

multiply and accumulate: vmlaq_f32 ( sum = v3 + v1 * v2 )

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 2.0, 2.0, 2.0, 2.0 }, v3 = { 3.0, 3.0, 3.0, 3.0 };
float32x4_t acc = vmlaq_f32(v3, v1, v2);  // acc = v3 + v1 * v2
// => acc = { 5.0, 7.0, 9.0, 11.0 }

multiply by a scalar: vmulq_n_f32 or vmulq_n_f64 ( prod = V1 * a)

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float32_t s = 3.0;
float32x4_t prod = vmulq_n_f32(v, s);
// => prod = { 3.0, 6.0, 9.0, 12.0 }

multiply by a scalar and accumulate: vmlaq_n_f32 or vmlaq_n_f64 ( acc = v1*v2 + s)

float32x4_t v1 = { 1.0, 2.0, 3.0, 4.0 }, v2 = { 1.0, 1.0, 1.0, 1.0 };
float32_t s = 3.0;
float32x4_t acc = vmlaq_n_f32(v1, v2, s);
// => acc = { 4.0, 5.0, 6.0, 7.0 }

invert (needed for division): vrecpeq_f32 or vrecpeq_f64 ( reciprocal = 1 / v)

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float32x4_t reciprocal = vrecpeq_f32(v);
// => reciprocal = { 0.998046875, 0.499023438, 0.333007813, 0.249511719 }

invert (more accurately): use a Newton-Raphson iteration to refine the estimate( reciprocal = 1 / v)

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float32x4_t reciprocal = vrecpeq_f32(v);
float32x4_t inverse = vmulq_f32(vrecpsq_f32(v, reciprocal), reciprocal);
// => inverse = { 0.999996185, 0.499998093, 0.333333015, 0.249999046 }

2.2 Load

load vector: vld1q_f32 or vld1q_f64

float values[5] = { 1.0, 2.0, 3.0, 4.0, 5.0 };
float32x4_t v = vld1q_f32(values);
// => v = { 1.0, 2.0, 3.0, 4.0 }

load same value for all lanes: vld1q_dup_f32 or vld1q_dup_f64

float val = 3.0;
float32x4_t v = vld1q_dup_f32(&val);
// => v = { 3.0, 3.0, 3.0, 3.0 }

set all lanes to a hardcoded value: vmovq_n_f16 or vmovq_n_f32 or vmovq_n_f64

float32x4_t v = vmovq_n_f32(1.5);
// => v = { 1.5, 1.5, 1.5, 1.5 }

2.3 Store

store vector: vst1q_f32 or vst1q_f64

float32x4_t v = { 1.0, 2.0, 3.0, 4.0 };
float values[5] = new float[5];
vst1q_f32(values, v);
// => values = { 1.0, 2.0, 3.0, 4.0, #undef }

store lane of array of vectors: vst4q_lane_f16 or vst4q_lane_f32 or vst4q_lane_f64 (change to vst1... / vst2... / vst3...for other array lengths)

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }, v1 = { 5.0, 6.0, 7.0, 8.0 }, v2 = { 9.0, 10.0, 11.0, 12.0 }, v3 = { 13.0, 14.0, 15.0, 16.0 };
float32x4x4_t u = { v0, v1, v2, v3 };
float buff[4];
vst4q_lane_f32(buff, u, 0);
// => buff = { 1.0, 5.0, 9.0, 13.0 }

2.4 Arrays

access to values: val[n]

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 }, v1 = { 5.0, 6.0, 7.0, 8.0 }, v2 = { 9.0, 10.0, 11.0, 12.0 }, v3 = { 13.0, 14.0, 15.0, 16.0 };
float32x4x4_t ary = { v0, v1, v2, v3 };
float32x4_t v = ary.val[2];
// => v = { 9.0, 10.0, 11.0, 12.0 }

2.5 Max and min

max of two vectors, element by element:

float32x4_t v0 = { 5.0, 2.0, 3.0, 4.0 }, v1 = { 1.0, 6.0, 7.0, 8.0 };
float32x4_t v2 = vmaxq_f32(v0, v1);
// => v1 = { 5.0, 6.0, 7.0, 8.0 }

max of vector elements, using folding maximum:

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 };
float32x2_t maxOfHalfs = vpmax_f32(vget_low_f32(v0), vget_high_f32(v0));
float32x2_t maxOfMaxOfHalfs = vpmax_f32(maxOfHalfs, maxOfHalfs);
float maxValue = vget_lane_f32(maxOfMaxOfHalfs, 0);
// => maxValue = 4.0

min of two vectors, element by element:

float32x4_t v0 = { 5.0, 2.0, 3.0, 4.0 }, v1 = { 1.0, 6.0, 7.0, 8.0 };
float32x4_t v2 = vminq_f32(v0, v1);
// => v1 = { 1.0, 2.0, 3.0, 4.0 }

min of vector elements, using folding minimum:

float32x4_t v0 = { 1.0, 2.0, 3.0, 4.0 };
float32x2_t minOfHalfs = vpmin_f32(vget_low_f32(v0), vget_high_f32(v0));
float32x2_t minOfMinOfHalfs = vpmin_f32(minOfHalfs, minOfHalfs);
float minValue = vget_lane_f32(minOfMinOfHalfs, 0);
// => minValue = 1.0

2.5 conditionals

ternary operator: use vector comparison (for example vcltq_f32 for less than comparison)

float32x4_t v1 = { 1.0, 0.0, 1.0, 0.0 }, v2 = { 0.0, 1.0, 1.0, 0.0 };
float32x4_t mask = vcltq_f32(v1, v2);  // v1 < v2
float32x4_t ones = vmovq_n_f32(1.0), twos = vmovq_n_f32(2.0);
float32x4_t v3 = vbslq_f32(mask, ones, twos);  // will select first if mask 0, second if mask 1
// => v3 = { 2.0, 1.0, 2.0, 2.0 }

3. Sample for openCV, c pointer, Neon

前提：圖片大小 640 x 480，動作：每三行的各列相加等於當前列。例如：x（i，j） = x（i, j） +x(i - 1, j) + x(i-2, j).

openCV的做法：其中，cv::Mat gray, src .src是來自每一幀圖片(640x480 deep = 8bits)

GETTIME(&lTimeStart);
for (int col = 0; col < gray.cols; col++)
{
		gray.at<uchar>(0, col) = src.at<uchar>(0, col);
}
for (int col = 0; col < gray.cols; col++)
{
		gray.at<uchar>(1, col) = gray.at<uchar>(0, col) + src.at<uchar>(1, col);
}
for (int col = 0; col < gray.cols; col++)
{
		gray.at<uchar>(2, col) = gray.at<uchar>(1, col) + src.at<uchar>(2, col);
}
for (int row = 3; row < gray.rows; row++)
{
		for (int col = 0; col < gray.cols; col++)
		{
		   gray.at<uchar>(row, col) = gray.at<uchar>(row - 1, col) + src.at<uchar>(row, col) - src.at<uchar>(row - 3, col);
		}
}
GETTIME(&lTimeEnd);
printf("time %ldus\n",lTimeEnd - lTimeStart);

在arm-A57平臺，openCV消耗的時間均值：time = 19175us

c-pointer的做法：

GETTIME(&lTimeStart);
unsigned char *ptr = src.ptr(0);
unsigned char *grayPtr = gray.ptr(0);
for(int col = 0; col < gray.cols; col++)
{
      grayPtr[col] = ptr[col];
}
unsigned char *ptr1 = src.ptr(1);
unsigned char *grayPtr1 = gray.ptr(1);
for(int col =0; col < gray.cols; col++)
{
       grayPtr1[col] = ptr[col] + ptr1[col];//34us
}
unsigned char *ptr2 = NULL;
unsigned char *grayPtr2 = NULL;
for(int row = 2; row < gray.rows; row++)
{
       ptr = src.ptr(row - 2);
       ptr1 = src.ptr(row -1);
       ptr2 = src.ptr(row);
       grayPtr2 = gray.ptr(row);

       for(int col = 0; col <gray.cols; col+=16){
          grayPtr2[col] = ptr[col] + ptr1[col] + ptr2[col];//11252us
        }
}
GETTIME(&lTimeEnd);
printf("time %ldus\n",lTimeEnd - lTimeStart);

在arm-A57平臺，C-pointer消耗的時間均值：time = 11252us

Neon 方式：

GETTIME(&lTimeStart);
unsigned char *ptr = src.ptr(0);
unsigned char *grayPtr = gray.ptr(0);
for(int col = 0; col < gray.cols; col++)
{
     grayPtr[col] = ptr[col];
}
unsigned char *ptr1 = src.ptr(1);
unsigned char *grayPtr1 = gray.ptr(1);
for(int col =0; col < gray.cols; col++)
{
     grayPtr1[col] = ptr[col] + ptr1[col];//34us
}
unsigned char *ptr2 = NULL;
unsigned char *grayPtr2 = NULL;
for(int row = 2; row < gray.rows; row++)
{
     ptr = src.ptr(row - 2);
     ptr1 = src.ptr(row -1);
     ptr2 = src.ptr(row);
     grayPtr2 = gray.ptr(row);

for(int col = 0; col <gray.cols; col+=16){
                
    uint8x16_t in1,in2,in3,out;
    in1 = vld1q_u8(ptr+col);
    in2 = vld1q_u8(ptr1+col);
    in3 = vld1q_u8(ptr2+col);
    out = vaddq_u8(in1,in2);
    out = vaddq_u8(in3,out);
    vst1q_u8(grayPtr2+col,out);

   }
}
GETTIME(&lTimeEnd);
printf("time %ldus\n",lTimeEnd - lTimeStart);

在arm-A57平臺，Neon intrinscis消耗的時間均值：time = 1907us

綜上，可以看到，neon相對opencv方式的性能提升快10倍。（注意，這裏的加法都有溢出的情況，由於本算法特殊，所以沒有做溢出處理）

1.介紹

2.Syntax

3. Sample for openCV, c pointer, Neon

linux內存工具查看歸納

/usr/bin./ld: cannot find -l

軟件架構之路 1

從4行代碼看右值引用

C++編程學習52個經典網站

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結