NCNN: 應用於手機上的卷積加速

從C++ 到android

在ncnn中是用C++寫的,沒玩過android很是愧疚。幸好項目中有android依賴鏈的cmake文件。

	Android CMake toolchain file, for use with the Android NDK r5-r10d 
	Requires cmake 2.6.3 or newer (2.8.9 or newer is recommended).
	See home page: https://github.com/taka-no-me/android-cmake

	## Usage Linux:
	$ export ANDROID_NDK=/absolute/path/to/the/android-ndk
	$ mkdir build && cd build
	$ cmake -DCMAKE_TOOLCHAIN_FILE=path/to/the/android.toolchain.cmake ..
	$ make -j8

	## Usage Windows:
	You need native port of make to build your project.
	Android NDK r7 (and newer) already has make.exe on board.
	For older NDK you have to install it separately.
	For example, this one: http://gnuwin32.sourceforge.net/packages/make.htm

	$ SET ANDROID_NDK=C:\absolute\path\to\the\android-ndk
	$ mkdir build && cd build
	$ cmake.exe -G"MinGW Makefiles"
	-DCMAKE_TOOLCHAIN_FILE=path\to\the\android.toolchain.cmake
	-DCMAKE_MAKE_PROGRAM="%ANDROID_NDK%\prebuilt\windows\bin\make.exe" ..
	$ cmake.exe --build .

內心很是欣喜,再次凸顯註釋的重要性。沿着提示信息,我就學習了一下android-cmake

cmake -DCMAKE_TOOLCHAIN_FILE=android.toolchain.cmake \
      -DANDROID_NDK=<ndk_path>                       \
      -DCMAKE_BUILD_TYPE=Release                     \
      -DANDROID_ABI="armeabi-v7a with NEON"          \
      <source_path>
cmake --build .

這個跟直接在x86平臺最大的區別就是需要額外指定cmake tool chain的配置文件android.toolchain.cmake和tool chain的依賴路徑-DANDROID_NDK=<ndk_path>,其他跟直接使用CMakeLists.txt相似。讓我再次感慨,CMake很是666的。另外-DANDROID_ABI="armeabi-v7a with NEON" 是用來制定編譯target的一些屬性和優化方案等等。這個具體可以到android.toolchain.cmake文件中看個究竟,而且在註釋中有相應的說明。

實戰C++編譯android lib

宏觀剖析NCNN

涉及技術

  1. OpenMP
    #pragma omp parallel for
    for (int p=0; p<outch; p++)
    {
      ...
	}
  1. SIMD
  2. loop unrolling
  3. CNN的基本概念

主要實現CPU上的forward和backward的加速方案。其中卷積過程主要採用直接進行卷積計算的方式,沒有采用img2col和FFT的方式,就是對直接卷積過程進行了實現和加速。

卷積計算方法分析

--src
	|--blob.cpp   
	|--blob.h
	|--cpu.cpp
	|--cpu.h
	|--layer
	|--layer.cpp
	|--layer.h
	|--mat.cpp
	|--mat.h
	|--mat_pixel.cpp
	|--net.cpp
	|--net.h
	|--opencv.cpp
	|--opencv.h
	|--platform.h.in

cpu.h: 主要獲取cpu的個數,頻率,設置省電模式,以及動態調整的函數。

Mat

mat.h:主要數據結構。

在其中扮演着參數傳遞和圖像存儲的作用。如果想對內存管理進行了解,可以好好看看mat.cpp的相關實現。理解這個數據結構,對後面理解優化過程很有幫助。

// exchange-add operation for atomic operations on reference counters
#if defined __INTEL_COMPILER && !(defined WIN32 || defined _WIN32)
// atomic increment on the linux version of the Intel(tm) compiler
#  define NCNN_XADD(addr, delta) (int)_InterlockedExchangeAdd(const_cast<void*>(reinterpret_cast<volatile void*>(addr)), delta)
#elif defined __GNUC__
#  if defined __clang__ && __clang_major__ >= 3 && !defined __ANDROID__ && !defined __EMSCRIPTEN__ && !defined(__CUDACC__)
#    ifdef __ATOMIC_ACQ_REL
#      define NCNN_XADD(addr, delta) __c11_atomic_fetch_add((_Atomic(int)*)(addr), delta, __ATOMIC_ACQ_REL)
#    else
#      define NCNN_XADD(addr, delta) __atomic_fetch_add((_Atomic(int)*)(addr), delta, 4)
#    endif
#  else
#    if defined __ATOMIC_ACQ_REL && !defined __clang__
// version for gcc >= 4.7
#      define NCNN_XADD(addr, delta) (int)__atomic_fetch_add((unsigned*)(addr), (unsigned)(delta), __ATOMIC_ACQ_REL)
#    else
#      define NCNN_XADD(addr, delta) (int)__sync_fetch_and_add((unsigned*)(addr), (unsigned)(delta))
#    endif
#  endif
#elif defined _MSC_VER && !defined RC_INVOKED
#  include <intrin.h>
#  define NCNN_XADD(addr, delta) (int)_InterlockedExchangeAdd((long volatile*)addr, delta)
#else
static inline void NCNN_XADD(int* addr, int delta) { int tmp = *addr; *addr += delta; return tmp; }
#endif

是編譯器的一些特性函數,此處主要用來實現原子加法操作了。畢竟將來是多線程運行的,引用計數錯誤可是會導致內存泄露的。

mat_pixel.cpp:

主要用來實現Mat與圖像的流的轉化,純手擼,沒有依賴opencv,沒有依賴opencv,沒有依賴opencv. 但是圖片是怎麼輸入的呢

圖片如何輸入的

在opencv.h和opencv.cpp中實現了一個縮小版本的的圖片輸入輸出文件,默認是不使用的。這個我得試試 option(NCNN_OPENCV "minimal opencv structure emulation" OFF) 因爲我看example中使用的不是自己實現的這個文件輸入輸出,而是直接調用OpenCV的圖片讀取接口,然後調用的ncnn::Mat::from_pixels_resize(...)進行的數據轉化,加上那麼多格式的圖片,所以這個重新進行自己運行確認一下比較好。

blob.h

主要用來記錄層與層之間關係的,採用了生產者和消費者模型實現的。其使用的地方在Net中的一個參數blobs

class Blob
{
public:
    // empty
    Blob();

public:
#if NCNN_STRING
    // blob name
    std::string name;
#endif // NCNN_STRING
    // layer index which produce this blob as output
    int producer;
    // layer index which need this blob as input
    std::vector<int> consumers;
};

net.h 這個整個神經網絡模型的框架實現。重要的有

  • Net如何加載模型和參數的
  • 如何表示層間關係的

神經網絡結構定義

網絡結構

class Net
{
protected:
    std::vector<Blob> blobs;
    std::vector<Layer*> layers;
    std::vector<layer_registry_entry> custom_layer_registry;
};

註冊函數

// layer factory function
typedef Layer* (*layer_creator_func)();

struct layer_registry_entry
{
    // layer type name
    const char* name;
    // layer factory entry
    layer_creator_func creator;
};

// get layer type from type name
int layer_to_index(const char* type);
// create layer from layer type
Layer* create_layer(int index);

#define DEFINE_LAYER_CREATOR(name) \
    Layer* name##_layer_creator() { return new name; }

添加自定義網絡層次

int Net::register_custom_layer(const char* type, layer_creator_func creator)
{
    int typeindex = layer_to_index(type);
    if (typeindex != 0)
    {
        fprintf(stderr, "can not register build-in layer type %s\n", type);
        return -1;
    }

    int custom_index = custom_layer_to_index(type);
    if (custom_index == -1)
    {
        struct layer_registry_entry entry = { type, creator };
        custom_layer_registry.push_back(entry);
    }
    else
    {
        fprintf(stderr, "overwrite existing custom layer type %s\n", type);
        custom_layer_registry[custom_index].name = type;
        custom_layer_registry[custom_index].creator = creator;
    }

    return 0;
}

卷積過程

直接進行卷積運算,想必應該知道的。那麼如何對這個過程進行加速呢?這部分得看src/layer文件的Convolution.h就是沒有向量化加速的卷積過程,僅僅使用了OpenMP框架進行了多線程化;在arm架構下實現了convolution_arm.cpp, 在x86架構下實現了convolution_x86.cpp。怎麼知道到底是使用加速版本還是非加速版本呢?這個得看/src/CMakeLists.txt中的ncnn_add_layer是如何定義的。

macro(ncnn_add_layer class)
    string(TOLOWER ${class} name)

    # WITH_LAYER_xxx option
    if(${ARGC} EQUAL 2)
        option(WITH_LAYER_${name} "build with layer ${name}" ${ARGV1})
    else()
        option(WITH_LAYER_${name} "build with layer ${name}" ON)
    endif()

    message("WITH_LAYER_${name} = ${WITH_LAYER_${name}}")

    if(WITH_LAYER_${name})
        # 向源文件列表中添加默認cpp文件,這個不重要
        list(APPEND ncnn_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/layer/${name}.cpp")

        # look for arch specific implementation and append source
        # optimized implementation for armv7 aarch64
        # 查找特定架構下的實現,是否存在,這很重要
        if((ANDROID AND ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "armv7-a"))
            OR (ANDROID AND ("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "aarch64"))
            OR (IOS AND ("${CMAKE_OSX_ARCHITECTURES}" STREQUAL "armv7"))
            OR (IOS AND ("${CMAKE_OSX_ARCHITECTURES}" STREQUAL "arm64"))
            OR (IOS AND ("${CMAKE_OSX_ARCHITECTURES}" STREQUAL "armv7;arm64")))
            if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/layer/arm/${name}_arm.cpp")
                # 添加源文件到編譯文件列表中
                list(APPEND ncnn_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/layer/arm/${name}_arm.cpp")
                # 添加宏定義define WITH_LAYER_xxx_arm 1
                set(WITH_LAYER_${name}_arm 1)
            endif()
        else()
            if(EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/layer/x86/${name}_x86.cpp")
                list(APPEND ncnn_SRCS "${CMAKE_CURRENT_SOURCE_DIR}/layer/x86/${name}_x86.cpp")
                set(WITH_LAYER_${name}_x86 1)
            endif()
        endif()
    endif()

    # generate layer_declaration and layer_registry file
    if(WITH_LAYER_${name})
        # 上面宏定義爲true了
        if(WITH_LAYER_${name}_arm)
            # 好,在下面兩個文件夾中添加,對應的層生成器(怎麼添加在前面的**神經網絡結構定義**),添加的是特定架構下的,這個很重要
            file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_declaration.h
                "extern Layer* ${class}_arm_layer_creator();\n")
            file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_registry.h
                "#if NCNN_STRING\n{\"${class}\",${class}_arm_layer_creator},\n#else\n{${class}_arm_layer_creator},\n#endif\n")
        elseif(WITH_LAYER_${name}_x86)
            file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_declaration.h
                "extern Layer* ${class}_x86_layer_creator();\n")
            file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_registry.h
                "#if NCNN_STRING\n{\"${class}\",${class}_x86_layer_creator},\n#else\n{${class}_x86_layer_creator},\n#endif\n")
        else()
            file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_declaration.h
                "extern Layer* ${class}_layer_creator();\n")
            file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_registry.h
                "#if NCNN_STRING\n{\"${class}\",${class}_layer_creator},\n#else\n{${class}_layer_creator},\n#endif\n")
        endif()
    else()
        file(APPEND ${CMAKE_CURRENT_BINARY_DIR}/layer_registry.h "#if NCNN_STRING\n{\"${class}\",0},\n#else\n{0},\n#endif\n")
    endif()
endmacro()

也就是說在cmake的時候組裝上的加速版本的層。然後編譯的時候,利用前面的介紹的層註冊機制實現的不同架構下可以使用不同的實現。

Convolution

int Convolution::forward(const Mat& bottom_blob, Mat& top_blob) const
{
    // convolv with NxN kernel
    // value = value + bias

    int w = bottom_blob.w;
    int h = bottom_blob.h;
    int channels = bottom_blob.c;

    // 添加padding操作
    const int kernel_extent = dilation * (kernel_size - 1) + 1;

    Mat bottom_blob_bordered = bottom_blob;
    if (pad > 0)
    {
        copy_make_border(bottom_blob, bottom_blob_bordered, pad, pad, pad, pad, BORDER_CONSTANT, 0.f);
        if (bottom_blob_bordered.empty())
            return -100;

        w = bottom_blob_bordered.w;
        h = bottom_blob_bordered.h;
    }
    else if (pad == -233)
    {
        int wpad = kernel_extent + (w - 1) / stride * stride - w;
        int hpad = kernel_extent + (h - 1) / stride * stride - h;

        copy_make_border(bottom_blob, bottom_blob_bordered, hpad / 2, hpad - hpad / 2, wpad / 2, wpad - wpad / 2, BORDER_CONSTANT, 0.f);
        if (bottom_blob_bordered.empty())
            return -100;

        w = bottom_blob_bordered.w;
        h = bottom_blob_bordered.h;
    }
    
    // 計算輸出大小
    int outw = (w - kernel_extent) / stride + 1;
    int outh = (h - kernel_extent) / stride + 1;

    top_blob.create(outw, outh, num_output);
    if (top_blob.empty())
        return -100;

    const int maxk = kernel_size * kernel_size;

    // kernel offsets
    // 計算kernel的在image中放置的相對位置,計算的時候直接取用就是了
    std::vector<int> _space_ofs(maxk);
    int* space_ofs = &_space_ofs[0];
    {
        int p1 = 0;
        int p2 = 0;
        int gap = w * dilation - kernel_size * dilation;
        for (int i = 0; i < kernel_size; i++)
        {
            for (int j = 0; j < kernel_size; j++)
            {
                space_ofs[p1] = p2;
                p1++;
                p2 += dilation;
            }
            p2 += gap;
        }
    }

    // num_output
    const float* weight_data_ptr = weight_data;
    #pragma omp parallel for
    for (int p=0; p<num_output; p++)
    {
        float* outptr = top_blob.channel(p);

        for (int i = 0; i < outh; i++)
        {
            for (int j = 0; j < outw; j++)
            {
                float sum = 0.f;

                if (bias_term)
                    sum = bias_data.data[p];
                // kernel 對應的權重
                const float* kptr = weight_data_ptr + maxk * channels * p;

                // channels
                for (int q=0; q<channels; q++)
                {
                    // 相當於image對應的channel的值
                    const Mat m = bottom_blob_bordered.channel(q);
                    const float* sptr = m.data + m.w * i*stride + j*stride;

                    for (int k = 0; k < maxk; k++) // 29.23
                    {
                        float val = sptr[ space_ofs[k] ]; // 20.72
                        float w = kptr[k];
                        sum += val * w; // 41.45
                    }

                    kptr += maxk;
                }

                outptr[j] = sum;
            }

            outptr += outw;
        }
    }

    return 0;
}

從上面可以看出,實現Convolution使用的是直接計算的方法,不是caffe裏面img2col的方法。下面再分析一下x86下如何加速的。

x86僅僅實現了3x3和5x5的卷積的skip=1的加速過程,主要是在Convolution_x86::forward中進行的。其中有一個conv_func_table數組,是具體的卷積過程的指針函數數組。下面我們來看3x3是如何優化的:

static void conv3x3s1_sse(const Mat& bottom_blob, Mat& top_blob, const Mat& _kernel, const Mat& _bias)
{
    // 獲取輸入的大小
    int w = bottom_blob.w;
    int inch = bottom_blob.c;
    // 輸出的大小
    int outw = top_blob.w;
    int outh = top_blob.h;
    int outch = top_blob.c;

    const float* kernel = _kernel;
    const float* bias = _bias;
    // 調用openmp進行並行化
    #pragma omp parallel for
    for (int p=0; p<outch; p++)
    {
        Mat out = top_blob.channel(p);

        const float bias0 = bias ? bias[p] : 0.f;

        out.fill(bias0);

        for (int q=0; q<inch; q++)
        {
            float* outptr = out;
            float* outptr2 = outptr + outw;

            const float* img0 = bottom_blob.channel(q);

            const float* kernel0 = kernel + p*inch*9  + q*9;

            const float* r0 = img0;
            const float* r1 = img0 + w;
            const float* r2 = img0 + w*2;
            const float* r3 = img0 + w*3;

            const float* k0 = kernel0;
            const float* k1 = kernel0 + 3;
            const float* k2 = kernel0 + 6;

            int i = 0;
            // 主要優化過程,主要是進行了循環展開方法,每次計算兩個位置的卷積
            // 利用編譯器的編譯進行加速的
            for (; i+1 < outh; i+=2)
            {

                int remain = outw;

                for (; remain>0; remain--)
                {
                    float sum = 0;
                    float sum2 = 0;

                    sum += r0[0] * k0[0];
                    sum += r0[1] * k0[1];
                    sum += r0[2] * k0[2];
                    sum += r1[0] * k1[0];
                    sum += r1[1] * k1[1];
                    sum += r1[2] * k1[2];
                    sum += r2[0] * k2[0];
                    sum += r2[1] * k2[1];
                    sum += r2[2] * k2[2];

                    sum2 += r1[0] * k0[0];
                    sum2 += r1[1] * k0[1];
                    sum2 += r1[2] * k0[2];
                    sum2 += r2[0] * k1[0];
                    sum2 += r2[1] * k1[1];
                    sum2 += r2[2] * k1[2];
                    sum2 += r3[0] * k2[0];
                    sum2 += r3[1] * k2[1];
                    sum2 += r3[2] * k2[2];

                    *outptr += sum;
                    *outptr2 += sum2;

                    r0++;
                    r1++;
                    r2++;
                    r3++;
                    outptr++;
                    outptr2++;
                }

                r0 += 2 + w;
                r1 += 2 + w;
                r2 += 2 + w;
                r3 += 2 + w;

                outptr += outw;
                outptr2 += outw;
            }
            // 殘餘處理,每次只進行一個位置的卷積計算
            for (; i < outh; i++)
            {
                int remain = outw;

                for (; remain>0; remain--)
                {
                    float sum = 0;

                    sum += r0[0] * k0[0];
                    sum += r0[1] * k0[1];
                    sum += r0[2] * k0[2];
                    sum += r1[0] * k1[0];
                    sum += r1[1] * k1[1];
                    sum += r1[2] * k1[2];
                    sum += r2[0] * k2[0];
                    sum += r2[1] * k2[1];
                    sum += r2[2] * k2[2];

                    *outptr += sum;

                    r0++;
                    r1++;
                    r2++;
                    outptr++;
                }

                r0 += 2;
                r1 += 2;
                r2 += 2;
            }

        }
    }

}

Pooling

Pooling的實現與Convolution的實現很相似,這裏就不做解釋了。Pooling主要實現了全局MaxPooling, 全局AvePooling, 局部Max Pooling與局部AvePooling.局部Pooling包含邊緣Padding的處理。 具體padding的處理可以看 void copy_make_border(...)如何實現的。

Softmax

採用omp parallel for的方式,進行如下操作: value = exp( value - global max value ) sum all value value = value / sum 在ARM平臺上,進一步通過SIMD技術加速優化,這裏不再贅述。

當然,Sigmoid在ARM上也進行了類似SoftMax的優化,Sigmod比Softmax實現簡單得多,這裏就不說了。

Batch Normalization

BatchNorm層有點不一樣,僅僅看forward是看不明白的,需要看load_model中的一些操作。

a = bias - slope * mean / sqrt(var)
b = slope / sqrt(var)
value = b * value + a

一般解釋的時候slope就是當做1進行解釋的。

int BatchNorm::load_model(const unsigned char*& mem)
{
    slope_data = Mat(channels, (float*)mem);
    mem += channels * sizeof(float);

    mean_data = Mat(channels, (float*)mem);
    mem += channels * sizeof(float);

    var_data = Mat(channels, (float*)mem);
    mem += channels * sizeof(float);

    bias_data = Mat(channels, (float*)mem);
    mem += channels * sizeof(float);

    a_data.create(channels);
    if (a_data.empty())
        return -100;
    b_data.create(channels);
    if (b_data.empty())
        return -100;
    const float* slope_data_ptr = slope_data;
    const float* mean_data_ptr = mean_data;
    const float* var_data_ptr = var_data;
    const float* bias_data_ptr = bias_data;
    float* a_data_ptr = a_data;
    float* b_data_ptr = b_data;
    for (int i=0; i<channels; i++)
    {
        float sqrt_var = sqrt(var_data_ptr[i]);
        a_data_ptr[i] = bias_data_ptr[i] - slope_data_ptr[i] * mean_data_ptr[i] / sqrt_var;
        b_data_ptr[i] = slope_data_ptr[i] / sqrt_var;
    }

    return 0;
}

int BatchNorm::forward(const Mat& bottom_blob, Mat& top_blob) const
{
    // a = bias - slope * mean / sqrt(var)
    // b = slope / sqrt(var)
    // value = b * value + a

    int w = bottom_blob.w;
    int h = bottom_blob.h;
    int size = w * h;

    top_blob.create(w, h, channels);
    if (top_blob.empty())
        return -100;

    const float* a_data_ptr = a_data;
    const float* b_data_ptr = b_data;
    #pragma omp parallel for
    for (int q=0; q<channels; q++)
    {
        const float* ptr = bottom_blob.channel(q);
        float* outptr = top_blob.channel(q);

        float a = a_data_ptr[q];
        float b = b_data_ptr[q];

        for (int i=0; i<size; i++)
        {
            outptr[i] = b * ptr[i] + a;
        }
    }

    return 0;
}

如果對BatchNormalization不明白,可看看後面的參考文獻[3][4]。

積澱姿勢

幾個有用的內存管理對齊方案,這個在OpenCV裏見過,放在這裏再次表示其重要性,爲什麼這麼實現,可以搜索OpenCV內存管理

// the alignment of all the allocated buffers
#define MALLOC_ALIGN    16

// Aligns a pointer to the specified number of bytes
// ptr Aligned pointer
// n Alignment size that must be a power of two
template<typename _Tp> static inline _Tp* alignPtr(_Tp* ptr, int n=(int)sizeof(_Tp))
{
    return (_Tp*)(((size_t)ptr + n-1) & -n);
}

// Aligns a buffer size to the specified number of bytes
// The function returns the minimum number that is greater or equal to sz and is divisible by n
// sz Buffer size to align
// n Alignment size that must be a power of two
static inline size_t alignSize(size_t sz, int n)
{
    return (sz + n-1) & -n;
}
// 原來不止python可以使用負下標,C也是可以的嘍
static inline void* fastMalloc(size_t size)
{
    unsigned char* udata = (unsigned char*)malloc(size + sizeof(void*) + MALLOC_ALIGN);
    if (!udata)
        return 0;
    unsigned char** adata = alignPtr((unsigned char**)udata + 1, MALLOC_ALIGN);
    adata[-1] = udata;
    return adata;
}

static inline void fastFree(void* ptr)
{
    if (ptr)
    {
        unsigned char* udata = ((unsigned char**)ptr)[-1];
        free(udata);
    }
}

小結

剩下的就是去看intel的intrinsics文檔了。

參考文獻

  1. android cmake入門指導
  2. SMP Symmetric Multi-Processor
  3. Batch Normalization 學習筆記
  4. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
  5. Intel Intrinsics Guide
發佈了44 篇原創文章 · 獲贊 17 · 訪問量 5萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章