最近鄰 resize 的實現和優化

1. 目的

記錄最近鄰插值(resize)在如下限定條件下的優化:

  • Android ARMv8 平臺, 小米11 (QCOM 888)
  • 單線程
  • 綁小核
  • 原圖: w=1920, h=1080, c=3
  • 目標: w=768, h=448, c=3
  • 和 OpenCV 4.5.5 Android 比對結果和速度

優化結果: 從 19 ms 優化到 2.5 ms. 作爲對比:

  • cv::resize( INTER_NEAREST ) 耗時 4 ms
  • 沒寫 SIMD,通用優化(第4次優化)基本和 OpenCV 性能持平
  • 特定優化(提前知道通道數、圖像尺寸)進一步提升了性能

1. 影響性能的因素

按優先級排序:

  1. 算法 / 策略. (本文不考慮)
  2. 大小核心. 小核心耗時可能是大核心的10倍
  3. 訪問像素時的索引計算.
    • 常規的像素座標訪問
    • 插值算法中的 src/dst 座標映射
  4. 提前知曉參數, 編譯器自動優化
    • channels, height, width,縮放比例參數

這裏貼出完整的耗時結果供參考, 後續則對每種優化貼出具體代碼:

小核心:

method cost
cv::resize 4.116511 ms
plain::resize_nn 19.32479 ms
better::resize_nn_v1 16.67385 ms
better::resize_nn_v2 8.003958 ms
better::resize_nn_v3 5.574219 ms
better::resize_nn_v4 4.313802 ms
better::resize_nn_v5 3.067396 ms
better::resize_nn_v6 2.856667 ms

大核心:

method cost
cv::resize 0.975156 ms
plain::resize_nn 2.391042 ms
better::resize_nn_v1 2.247709 ms
better::resize_nn_v2 1.565156 ms
better::resize_nn_v3 1.208750 ms
better::resize_nn_v4 1.189636 ms
better::resize_nn_v5 0.862031 ms
better::resize_nn_v6 0.761927 ms

2. 對照: cv::resize 耗時(4ms)

std::string filename = load_prefix + "/1920x1080.jpg";

cv::Mat src = cv::imread(filename);

cv::Size dsz;
dsz.height = 448;
dsz.width = 768;

cv::Mat result_cv;
{
    AutoTimer timer("cv::resize");
    cv::resize(src, result_cv, dsz, 0, 0, cv::INTER_NEAREST);
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

3. 實現和優化

naive 實現. 19ms

template<typename T>
static
T clip(T val, T minval, T maxval)
{
    if (minval > maxval) {
        std::swap(minval, maxval);
    }

    T result = val;
    if (val < minval) {
        result = minval;
    }
    if (val > maxval) {
        result = maxval;
    }
    return result;
}

void resize_nn_v0(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    int channels = src.channels();

    dst.create(dsize, src.type());

    const int src_height = ssize.height;
    const int src_width = ssize.width;
    const int dst_height = dsize.height;
    const int dst_width = dsize.width;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    for (int i = 0; i < dst_height; i++)
    {
        int src_i = static_cast<int>(i * scale_h);
        src_i = clip(src_i, 0, src_height-1);

        for (int j = 0; j < dst_width; j++)
        {
            int src_j = static_cast<int>(j * scale_w); 
            src_j = clip(src_j, 0, src_width-1);

            for (int k = 0; k < channels; k++)
            {
                // .ptr(int, int) returns uchar* type
                dst.ptr(i, j)[k] = src.ptr(src_i, src_j)[k];
            }
        }
    }
}

即使是樸素實現, 仍然避開了一些坑:

  • 使用了 cv::Mat, 避免了手動定義數據結構 (naive 實現時沒必要自己搞一套,應當通用)
  • 三重循環的遍歷, 是遍歷 dst 圖而非 src 圖, 使得映射索引的計算是用乘法獲得,而不是除法(除法指令週期比乘法多)
  • 提前算出縮放因子 scale_w, scale_h, 而非在 for 循環中重複計算
  • 爲保持和 OpenCV 結果一致, 對算出的索引施加截斷(clip)
  • src_i 這一映射索引的計算, 放在了外層循環而非內層循環,避免了一些重複計算

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

第1次優化:提前算索引. 提速 2ms

原本要做 O(dh * dw) 次的索引計算和使用, 現改爲 O(dw + dh) 次計算和 O(dw * dh) 次使用: 消除了不必要的乘法加法運算。

// 提前算索引. decrease 2ms
void resize_nn_v1(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    int channels = src.channels();

    dst.create(dsize, src.type());

    const int src_height = ssize.height;
    const int src_width = ssize.width;
    const int dst_height = dsize.height;
    const int dst_width = dsize.width;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    std::vector<int> src_h_table(dst_height);
    std::vector<int> src_w_table(dst_width);

    for (int i = 0; i < dst_height; i++)
    {
        src_h_table[i] = static_cast<int>(i * scale_h);
        src_h_table[i] = clip(src_h_table[i], 0, src_height - 1);
    }
    for (int j = 0; j < dst_width; j++)
    {
        src_w_table[j] = static_cast<int>(j * scale_w);
        src_w_table[j] = clip(src_w_table[j], 0, src_width - 1);
    }

    for (int i = 0; i < dst_height; i++)
    {
        //int src_i = static_cast<int>(i * scale_h);
        //src_i = clip(src_i, 0, src_height - 1);
        int src_i = src_h_table[i];

        for (int j = 0; j < dst_width; j++)
        {
            //int src_j = static_cast<int>(j * scale_w);
            //src_j = clip(src_j, 0, src_width - 1);
            int src_j = src_w_table[j];

            for (int k = 0; k < channels; k++)
            {
                // .ptr(int, int) returns uchar* type
                dst.ptr(i, j)[k] = src.ptr(src_i, src_j)[k];
            }
        }
    }
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

第2次優化:展開最裏層循環中的Mat.ptr到外層. 提速9ms

void resize_nn_v2(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    int channels = src.channels();

    dst.create(dsize, src.type());

    const int src_height = ssize.height;
    const int src_width = ssize.width;
    const int dst_height = dsize.height;
    const int dst_width = dsize.width;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    std::vector<int> src_h_table(dst_height);
    std::vector<int> src_w_table(dst_width);

    for (int i = 0; i < dst_height; i++)
    {
        src_h_table[i] = static_cast<int>(i * scale_h);
        src_h_table[i] = clip(src_h_table[i], 0, src_height - 1);
    }
    for (int j = 0; j < dst_width; j++)
    {
        src_w_table[j] = static_cast<int>(j * scale_w);
        src_w_table[j] = clip(src_w_table[j], 0, src_width - 1);
    }

    for (int i = 0; i < dst_height; i++)
    {
        int src_i = src_h_table[i];

        for (int j = 0; j < dst_width; j++)
        {
            int src_j = src_w_table[j];

            uchar* dst_pixel = dst.ptr(i, j);
            const uchar* src_pixel = src.ptr(src_i, src_j);
            for (int k = 0; k < channels; k++)
            {
                // .ptr(int, int) returns uchar* type
                //dst.ptr(i, j)[k] = src.ptr(src_i, src_j)[k];
                dst_pixel[k] = src_pixel[k];
            }
        }
    }
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

第3次優化: 展開次裏層循環中 Mat.ptr 爲外一層指針. 提速2ms

void resize_nn_v3(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    int channels = src.channels();

    dst.create(dsize, src.type());

    const int src_height = ssize.height;
    const int src_width = ssize.width;
    const int dst_height = dsize.height;
    const int dst_width = dsize.width;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    std::vector<int> src_h_table(dst_height);
    std::vector<int> src_w_table(dst_width);

    for (int i = 0; i < dst_height; i++)
    {
        src_h_table[i] = static_cast<int>(i * scale_h);
        src_h_table[i] = clip(src_h_table[i], 0, src_height - 1);
    }
    for (int j = 0; j < dst_width; j++)
    {
        src_w_table[j] = static_cast<int>(j * scale_w);
        src_w_table[j] = clip(src_w_table[j], 0, src_width - 1);
    }

    for (int i = 0; i < dst_height; i++)
    {
        int src_i = src_h_table[i];
        uchar* dst_line = dst.ptr(i);
        const uchar* src_line = src.ptr(src_i);
        for (int j = 0; j < dst_width; j++)
        {
            int src_j = src_w_table[j];

            //uchar* dst_pixel = dst.ptr(i, j);
            uchar* dst_pixel = dst_line + j * channels;
            //const uchar* src_pixel = src.ptr(src_i, src_j);
            const uchar* src_pixel = src_line + src_j * channels;
            for (int k = 0; k < channels; k++)
            {
                dst_pixel[k] = src_pixel[k];
            }
        }
    }
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

第4次優化: 表格存 src_j * channels 而不是 src_j. 提速 1.4 ms

void resize_nn_v4(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    int channels = src.channels();

    dst.create(dsize, src.type());

    const int src_height = ssize.height;
    const int src_width = ssize.width;
    const int dst_height = dsize.height;
    const int dst_width = dsize.width;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    std::vector<int> src_h_table(dst_height);
    std::vector<int> src_w_table(dst_width);

    for (int i = 0; i < dst_height; i++)
    {
        src_h_table[i] = static_cast<int>(i * scale_h);
        src_h_table[i] = clip(src_h_table[i], 0, src_height - 1);
    }
    for (int j = 0; j < dst_width; j++)
    {
        src_w_table[j] = static_cast<int>(j * scale_w);
        src_w_table[j] = clip(src_w_table[j], 0, src_width - 1) * channels;
    }

    for (int i = 0; i < dst_height; i++)
    {
        int src_i = src_h_table[i];
        uchar* dst_line = dst.ptr(i);
        const uchar* src_line = src.ptr(src_i);
        for (int j = 0; j < dst_width; j++)
        {
            //uchar* dst_pixel = dst.ptr(i, j);
            uchar* dst_pixel = dst_line + j * channels;
            //const uchar* src_pixel = src.ptr(src_i, src_j);
            const uchar* src_pixel = src_line + src_w_table[j];
            for (int k = 0; k < channels; k++)
            {
                dst_pixel[k] = src_pixel[k];
            }
        }
    }
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

第5次優化: 提前知曉通道數量爲3. 提速 2ms

void resize_nn_v5(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    //int channels = src.channels();
    const int channels = 3; //!!

    dst.create(dsize, src.type());

    const int src_height = ssize.height;
    const int src_width = ssize.width;
    const int dst_height = dsize.height;
    const int dst_width = dsize.width;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    std::vector<int> src_h_table(dst_height);
    std::vector<int> src_w_table(dst_width);

    for (int i = 0; i < dst_height; i++)
    {
        src_h_table[i] = static_cast<int>(i * scale_h);
        src_h_table[i] = clip(src_h_table[i], 0, src_height - 1);
    }
    for (int j = 0; j < dst_width; j++)
    {
        src_w_table[j] = static_cast<int>(j * scale_w);
        src_w_table[j] = clip(src_w_table[j], 0, src_width - 1) * channels;
    }

    for (int i = 0; i < dst_height; i++)
    {
        int src_i = src_h_table[i];
        uchar* dst_line = dst.ptr(i);
        const uchar* src_line = src.ptr(src_i);
        for (int j = 0; j < dst_width; j++)
        {
            //uchar* dst_pixel = dst.ptr(i, j);
            uchar* dst_pixel = dst_line + j * channels;
            //const uchar* src_pixel = src.ptr(src_i, src_j);
            const uchar* src_pixel = src_line + src_w_table[j];
            for (int k = 0; k < channels; k++)
            {
                dst_pixel[k] = src_pixel[k];
            }
        }
    }
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

第6次優化: 提前知曉原圖和目標圖的尺寸. 提速 0.5 ms

void resize_nn_v6(const cv::Mat& src, cv::Mat& dst, cv::Size dsize)
{
    int depth = CV_MAT_DEPTH(src.type());
    if (depth != CV_8U) {
        CV_Error(cv::Error::StsBadArg, "只支持 uchar 類型");
    }

    cv::Size ssize = src.size();
    //int channels = src.channels();
    const int channels = 3;

    dst.create(dsize, src.type());

    //const int src_height = ssize.height;
    //const int src_width = ssize.width;
    //const int dst_height = dsize.height;
    //const int dst_width = dsize.width;

    const int src_height = 1080;
    const int src_width = 1920;
    const int dst_height = 448;
    const int dst_width = 768;

    const float scale_w = src_width * 1.0 / dst_width;
    const float scale_h = src_height * 1.0 / dst_height;

    std::vector<int> src_h_table(dst_height);
    std::vector<int> src_w_table(dst_width);

    for (int i = 0; i < dst_height; i++)
    {
        src_h_table[i] = static_cast<int>(i * scale_h);
        src_h_table[i] = clip(src_h_table[i], 0, src_height - 1);
    }
    for (int j = 0; j < dst_width; j++)
    {
        src_w_table[j] = static_cast<int>(j * scale_w);
        src_w_table[j] = clip(src_w_table[j], 0, src_width - 1) * channels;
    }

    for (int i = 0; i < dst_height; i++)
    {
        int src_i = src_h_table[i];
        uchar* dst_line = dst.ptr(i);
        const uchar* src_line = src.ptr(src_i);
        for (int j = 0; j < dst_width; j++)
        {
            uchar* dst_pixel = dst_line + j * channels;
            const uchar* src_pixel = src_line + src_w_table[j];
            for (int k = 0; k < channels; k++)
            {
                dst_pixel[k] = src_pixel[k];
            }
        }
    }
}

耗時:
cv::resize: took 4.116511 ms
plain::resize_nn: took 19.324792 ms
better::resize_nn_v1: took 16.673854 ms
better::resize_nn_v2: took 8.003958 ms
better::resize_nn_v3: took 5.574219 ms
better::resize_nn_v4: took 4.313802 ms
better::resize_nn_v5: took 3.067396 ms
better::resize_nn_v6: took 2.856667 ms

4. 總結

1. 大小核心對性能的影響可能很大, 優化仍然有必要

最近鄰 resize 這樣一個簡單的函數, 簡單的測試可能得到很短的耗時,例如 2ms,進一步誤以爲沒什麼優化空間。而在複雜環境下可能會被分配到小核心上, 得到的耗時可能是原來的 10 倍。

2. Mat.ptr(i, j) 看着方便, 但在內存循環時大幅降低了性能

3. 提前計算出映射索引, 有一定性能提升, 不過沒有 Mat.ptr(i, j) 那麼大的影響

4. 編譯階段知曉更多的參數, 即使沒寫 SIMD, 也得到了比 OpenCV 更快的性能.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章