基於yolo的人臉檢測與人臉對齊

原創

阳光玻璃杯

2020-06-20 16:26

前言

YOLO（You Only Look Once）是一種基於深度神經網絡的對象識別和定位算法，yolo將對象定位作爲迴歸問題求解，在one-stage中實現對象定位與識別，其最大的特點就是快！快！快！

既然yolo本來就是通過迴歸的方法對對象定位，並與此同時對對象進行分類。那我們很容易想到yolo在做對象定位的同時可以對對象的特徵點進行迴歸，最常見的用例是人臉檢測與人臉對齊同步完成。

將人臉檢測和人臉對齊同步完成，mtcnn已經做了類型的事情：

由圖可見,mtcnn使用了三個卷積神經網絡實現了人臉檢測和人臉對齊，而使用yolo，我們將只用一個卷積神經網絡同時實現人臉檢測和人臉對齊：

我們將去掉yolo的分類邏輯，加入迴歸特徵點的邏輯。

爲了更好的迴歸，我們需要將基於圖像左上角的座標轉變爲基於矩形框中心的座標。

tx = (pred_x - centet_x)*2/w

ty = (pred_y - centet_y)*2/h

tx,ty爲最後的基於預測框中心的座標

centet_x,centet_y爲預測框的中心座標

w,h爲預測狂的寬和高。

pred_x,pred_y爲神經網絡輸出的預測座標。

經過如上處理，tx,ty應該在[-1,1]區間纔行，所以，我們需要對輸出進行限制。只需要讓輸出通過tanh激活函數即可。

pred_x = tanh(out_x)

pred_y = tanh(out_y)

out_x,out_y爲爲神經網絡輸出的未經激活的座標。

代碼如下：

static float delta_face_landmars(box fbox, float *truth, float *pred, float *delta, int index, int w, int h, int stride, float scale)
{
    int i;
    float diff = 0;
    float data[10];
    five_point(truth,data);
    for(i=0;i<5;++i){
        float tx = 2*(data[i*2] - fbox.x)/fbox.w;
        float ty = 2*(data[i*2+1] - fbox.y)/fbox.h;
        //printf("tx=%f,ty=%f,box: %f,%f,%f,%f,truth: %f,%f\n",tx,ty,fbox.x,fbox.y,fbox.w,fbox.h,truth[i*2],truth[i*2+1]);
        delta[index + (i*2)*stride] = scale*(tx - pred[index + (i*2)*stride]);
        delta[index + (i*2+1)*stride] = scale*(ty - pred[index + (i*2+1)*stride]);
        diff += 0.5*delta[index + (i*2)*stride]*delta[index + (i*2)*stride] + 0.5*delta[index + (i*2+1)*stride]*delta[index + (i*2+1)*stride];
    }
    return diff;
}

有了以上的準備，我們可以開始訓練我們的神經網絡了。

模型選擇

但是yolov2太過於龐大，我的電腦根本跑不了，yolo tiny可以跑，但是我並沒有選擇，yolo tiny模型，而是選擇了vgg16中的13個卷積層做特徵提取:

三個全連接層使用1*1的卷繼融合爲n*(4+1+10)個channel，n爲每個cell預測幾個矩形框。

最終的模型如下：

layer     filters    size              input                output
    0 conv     64  3 x 3 / 1   224 x 224 x   3   ->   224 x 224 x  64  0.173 BFLOPs
    1 conv     64  3 x 3 / 1   224 x 224 x  64   ->   224 x 224 x  64  3.699 BFLOPs
    2 max          2 x 2 / 2   224 x 224 x  64   ->   112 x 112 x  64
    3 conv    128  3 x 3 / 1   112 x 112 x  64   ->   112 x 112 x 128  1.850 BFLOPs
    4 conv    128  3 x 3 / 1   112 x 112 x 128   ->   112 x 112 x 128  3.699 BFLOPs
    5 max          2 x 2 / 2   112 x 112 x 128   ->    56 x  56 x 128
    6 conv    256  3 x 3 / 1    56 x  56 x 128   ->    56 x  56 x 256  1.850 BFLOPs
    7 conv    256  3 x 3 / 1    56 x  56 x 256   ->    56 x  56 x 256  3.699 BFLOPs
    8 conv    256  3 x 3 / 1    56 x  56 x 256   ->    56 x  56 x 256  3.699 BFLOPs
    9 max          2 x 2 / 2    56 x  56 x 256   ->    28 x  28 x 256
   10 conv    512  3 x 3 / 1    28 x  28 x 256   ->    28 x  28 x 512  1.850 BFLOPs
   11 conv    512  3 x 3 / 1    28 x  28 x 512   ->    28 x  28 x 512  3.699 BFLOPs
   12 conv    512  3 x 3 / 1    28 x  28 x 512   ->    28 x  28 x 512  3.699 BFLOPs
   13 max          2 x 2 / 2    28 x  28 x 512   ->    14 x  14 x 512
   14 conv    512  3 x 3 / 1    14 x  14 x 512   ->    14 x  14 x 512  0.925 BFLOPs
   15 conv    512  3 x 3 / 1    14 x  14 x 512   ->    14 x  14 x 512  0.925 BFLOPs
   16 conv    512  3 x 3 / 1    14 x  14 x 512   ->    14 x  14 x 512  0.925 BFLOPs
   17 max          2 x 2 / 2    14 x  14 x 512   ->     7 x   7 x 512
   18 conv     15  1 x 1 / 1     7 x   7 x 512   ->     7 x   7 x  15  0.001 BFLOPs
   19 face_aliment