在iOS上玩轉yolo

本文主要介紹YOLOv2在iOS手機端的實現
Paper：https://arxiv.org/abs/1612.08242
Github：https://github.com/pjreddie/darknet
Website：https://pjreddie.com/darknet/yolo

YOLOv2簡介

yolov2的輸入爲416x416，然後通過一些列的卷積、BN、Pooling操作最後到13x13x125的feature map大小。其中13x13對應原圖的13x13網格，如下圖所示。

125來自5x(5+20)，表示每一個cell中預測5個bounding boxes（表示5個anchor），每一個bounding boxes有x,y,w,h, confidence score(該框是目標的概率)，20個類的概率（PASCAL VOC數據集共有20類）。
anchor的值是通過在訓練集的框上用k-means聚類算法獲得。這裏用k-means計算距離時不是用的歐式距離，而是如下的IOU得分。

d(box, centroid) = 1 - IOU(box, centroid)

在iOS上怎麼實現？

由於在手機上運行需要考慮模型大小和速度問題，所以我們選擇使用tiny yolo。網絡結構如下：

Layer         kernel  stride  output shape
---------------------------------------------
Input                          (416, 416, 3)
Convolution    3×3      1      (416, 416, 16)
MaxPooling     2×2      2      (208, 208, 16)
Convolution    3×3      1      (208, 208, 32)
MaxPooling     2×2      2      (104, 104, 32)
Convolution    3×3      1      (104, 104, 64)
MaxPooling     2×2      2      (52, 52, 64)
Convolution    3×3      1      (52, 52, 128)
MaxPooling     2×2      2      (26, 26, 128)
Convolution    3×3      1      (26, 26, 256)
MaxPooling     2×2      2      (13, 13, 256)
Convolution    3×3      1      (13, 13, 512)
MaxPooling     2×2      1      (13, 13, 512)
Convolution    3×3      1      (13, 13, 1024)
Convolution    3×3      1      (13, 13, 1024)
Convolution    1×1      1      (13, 13, 125)
---------------------------------------------

整個網絡只有九層 convolution，注意該inference網絡已經去掉了BN層。另外，最後一層maxpooling沒有改變feature map的尺寸，所以該maxpooling的stride爲1。
採用Metal搭建YOLO網絡的代碼如下：

    public init(device: MTLDevice) {
        print("Setting up neural network...")
        let startTime = CACurrentMediaTime()
        
        self.device = device
        commandQueue = device.makeCommandQueue()
        
        conv9_img = MPSImage(device: device, imageDescriptor: conv9_id) //save the result
        
        lanczos = MPSImageLanczosScale(device: device)
        
        let relu = MPSCNNNeuronReLU(device: device, a: 0.1)
        
        
        conv1 = SlimMPSCNNConvolution(kernelWidth: 3,
                                         kernelHeight: 3,
                                         inputFeatureChannels: 3,
                                         outputFeatureChannels: 16,
                                         neuronFilter: relu,
                                         device: device,
                                         kernelParamsBinaryName: "conv1",
                                         padding: true,
                                         strideXY: (1,1))
        maxpooling1 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv2 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 16,
                                      outputFeatureChannels: 32,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv2",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling2 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv3 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 32,
                                      outputFeatureChannels: 64,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv3",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling3 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv4 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 64,
                                      outputFeatureChannels: 128,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv4",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling4 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv5 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 128,
                                      outputFeatureChannels: 256,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv5",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling5 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 2, strideInPixelsY: 2)
        conv6 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 256,
                                      outputFeatureChannels: 512,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv6",
                                      padding: true,
                                      strideXY: (1,1))
        maxpooling6 = MPSCNNPoolingMax(device: device, kernelWidth: 2, kernelHeight: 2, strideInPixelsX: 1, strideInPixelsY: 1)
        //offset setting is necessary to make sure 13x13->13x13 after pooling
        maxpooling6.offset = MPSOffset(x: 2, y: 2, z: 0)
        maxpooling6.edgeMode = MPSImageEdgeMode.clamp
        conv7 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 512,
                                      outputFeatureChannels: 1024,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv7",
                                      padding: true,
                                      strideXY: (1,1))
        conv8 = SlimMPSCNNConvolution(kernelWidth: 3,
                                      kernelHeight: 3,
                                      inputFeatureChannels: 1024,
                                      outputFeatureChannels: 1024,
                                      neuronFilter: relu,
                                      device: device,
                                      kernelParamsBinaryName: "conv8",
                                      padding: true,
                                      strideXY: (1,1))
        conv9 = SlimMPSCNNConvolution(kernelWidth: 1,
                                      kernelHeight: 1,
                                      inputFeatureChannels: 1024,
                                      outputFeatureChannels: 125,
                                      neuronFilter: nil,
                                      device: device,
                                      kernelParamsBinaryName: "conv9",
                                      padding: false,
                                      strideXY: (1,1))
        
        let endTime = CACurrentMediaTime()
        print("Elapsed time: \(endTime - startTime) sec")
    }

通過網絡得到13x13x125的feature map後需要把它轉換爲對應的5個bounding boxes,轉換方式如下：

轉換的代碼實現如下：

                // The predicted tx and ty coordinates are relative to the location
                // of the grid cell; we use the logistic sigmoid to constrain these
                // coordinates to the range 0 - 1. Then we add the cell coordinates
                // (0-12) and multiply by the number of pixels per grid cell (32).
                // Now x and y represent center of the bounding box in the original
                // 416x416 image space.
                let x = (Float(cx) + Math.sigmoid(tx)) * blockSize
                let y = (Float(cy) + Math.sigmoid(ty)) * blockSize
                
                // The size of the bounding box, tw and th, is predicted relative to
                // the size of an "anchor" box. Here we also transform the width and
                // height into the original 416x416 image space.
                let w = exp(tw) * anchors[2*b    ] * blockSize
                let h = exp(th) * anchors[2*b + 1] * blockSize
                
                // The confidence value for the bounding box is given by tc. We use
                // the logistic sigmoid to turn this into a percentage.
                let confidence = Math.sigmoid(tc)

轉換爲bounding boxes之後框共有13x13x5個，這裏面很多框都不對，此時需要通過bestClassScore * confidence>0.3來過濾無用的框。過濾之後還是會有很多滿足條件的框，所有最後還需要通過非極大值抑制算法來去除冗餘的框。
NMS的實現代碼如下

/**
 Removes bounding boxes that overlap too much with other boxes that have
 a higher score.
 
 Based on code from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/non_max_suppression_op.cc
 
 - Parameters:
 - boxes: an array of bounding boxes and their scores
 - limit: the maximum number of boxes that will be selected
 - threshold: used to decide whether boxes overlap too much
 */
func nonMaxSuppression(boxes: [YOLO.Prediction], limit: Int, threshold: Float) -> [YOLO.Prediction] {
    
    // Do an argsort on the confidence scores, from high to low.
    let sortedIndices = boxes.indices.sorted { boxes[$0].score > boxes[$1].score }
    
    var selected: [YOLO.Prediction] = []
    var active = [Bool](repeating: true, count: boxes.count)
    var numActive = active.count
    
    // The algorithm is simple: Start with the box that has the highest score.
    // Remove any remaining boxes that overlap it more than the given threshold
    // amount. If there are any boxes left (i.e. these did not overlap with any
    // previous boxes), then repeat this procedure, until no more boxes remain
    // or the limit has been reached.
    outer: for i in 0..<boxes.count {
        if active[i] {
            let boxA = boxes[sortedIndices[i]]
            selected.append(boxA)
            if selected.count >= limit { break }
            
            for j in i+1..<boxes.count {
                if active[j] {
                    let boxB = boxes[sortedIndices[j]]
                    if IOU(a: boxA.rect, b: boxB.rect) > threshold {
                        active[j] = false
                        numActive -= 1
                        if numActive <= 0 { break outer }
                    }
                }
            }
        }
    }
    return selected
}