YOLO v3 原理總結

論文:YOLOv3: An Incremental Improvement

YOLO3主要的改進有:調整了網絡結構;利用多尺度特徵進行對象檢測;對象分類用Logistic取代了softmax。

新的網絡結構Darknet-53

在基本的圖像特徵提取方面,YOLO3採用了稱之爲Darknet-53的網絡結構(含有53個卷積層),它借鑑了殘差網絡residual network的做法,在一些層之間設置了快捷鏈路(shortcut connections)。

 

利用多尺度特徵進行對象檢測

YOLO2曾採用passthrough結構來檢測細粒度特徵,在YOLO3更進一步採用了3個不同尺度的特徵圖來進行對象檢測。

結合上圖看,卷積網絡在79層後,經過下方几個黃色的卷積層得到一種尺度的檢測結果。相比輸入圖像,這裏用於檢測的特徵圖有32倍的下采樣。比如輸入是416*416的話,這裏的特徵圖就是13*13了。由於下采樣倍數高,這裏特徵圖的感受野比較大,因此適合檢測圖像中尺寸比較大的對象。

爲了實現細粒度的檢測,第79層的特徵圖又開始作上採樣,然後與第61層特徵圖融合(Concatenation),這樣得到第91層較細粒度的特徵圖,同樣經過幾個卷積層後得到相對輸入圖像16倍下采樣的特徵圖。它具有中等尺度的感受野,適合檢測中等尺度的對象。

最後,第91層特徵圖再次上採樣,並與第36層特徵圖融合(Concatenation),最後得到相對輸入圖像8倍下采樣的特徵圖。它的感受野最小,適合檢測小尺寸的對象。

 

9種尺度的先驗框

隨着輸出的特徵圖的數量和尺度的變化,先驗框的尺寸也需要相應的調整。YOLO2已經開始採用K-means聚類得到先驗框的尺寸,YOLO3延續了這種方法,爲每種下采樣尺度設定3種先驗框,總共聚類出9種尺寸的先驗框。在COCO數據集這9個先驗框是:(10x13),(16x30),(33x23),(30x61),(62x45),(59x119),(116x90),(156x198),(373x326)。

分配上,在最小的13*13特徵圖上(有最大的感受野)應用較大的先驗框(116x90),(156x198),(373x326),適合檢測較大的對象。中等的26*26特徵圖上(中等感受野)應用中等的先驗框(30x61),(62x45),(59x119),適合檢測中等大小的對象。較大的52*52特徵圖上(較小的感受野)應用較小的先驗框(10x13),(16x30),(33x23),適合檢測較小的對象。

感受一下9種先驗框的尺寸,下圖中藍色框爲聚類得到的先驗框。黃色框式ground truth,紅框是對象中心點所在的網格。

 

對象分類softmax改成logistic

預測對象類別時不使用softmax,改成使用logistic的輸出進行預測。這樣能夠支持多標籤對象(比如一個人有Woman 和 Person兩個標籤)。

 

輸入映射到輸出

我們看一下YOLO3共進行了多少個預測。對於一個416*416的輸入圖像,在每個尺度的特徵圖的每個網格設置3個先驗框,總共有 13*13*3 + 26*26*3 + 52*52*3 = 10647 個預測。每一個預測是一個(4+1+80)=85維向量,這個85維向量包含邊框座標(4個數值),邊框置信度(1個數值),對象類別的概率(對於COCO數據集,有80種對象)。

對比一下,YOLO2採用13*13*5 = 845個預測,YOLO3的嘗試預測邊框數量增加了10多倍,而且是在不同分辨率上進行,所以mAP以及對小物體的檢測效果有一定的提升。

 

數據處理

Train

class YOLOLayer(nn.Module):
    """Detection layer"""

    def __init__(self, anchors, num_classes, img_dim=416):
        super(YOLOLayer, self).__init__()
        self.anchors = anchors
        self.num_anchors = len(anchors)
        self.num_classes = num_classes
        self.ignore_thres = 0.5
        self.mse_loss = nn.MSELoss()
        self.bce_loss = nn.BCELoss()
        self.obj_scale = 1
        self.noobj_scale = 100
        self.metrics = {}
        self.img_dim = img_dim
        self.grid_size = 0  # grid size

    def compute_grid_offsets(self, grid_size, cuda=True):
        """
        將anchor box轉換成以格子爲單位;計算每個格子的像素數
        :param grid_size:
        :param cuda:
        :return:
        """
        self.grid_size = grid_size
        g = self.grid_size
        FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
        self.stride = self.img_dim / self.grid_size # 每個格子包含的resize後圖片的像素數
        # Calculate offsets for each grid
        self.grid_x = torch.arange(g).repeat(g, 1).view([1, 1, g, g]).type(FloatTensor) #[1, 1, grid_size, grid_size]用來計算中心點x方向的座標
        self.grid_y = torch.arange(g).repeat(g, 1).t().view([1, 1, g, g]).type(FloatTensor) #[1, 1, grid_size, grid_size]用來計算中心點方向的座標
        self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors]) # 將anchor box的座標由像素爲單位轉換成格子爲單位, [n_anchorbox, 2]
        self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1)) # anchor box 的w , [1, n_anchorbox, 1, 1]
        self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1)) # anchor box 的h , [1, n_anchorbox, 1, 1]

    def forward(self, x, targets=None, img_dim=None):

        # Tensors for cuda support
        FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
        ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor

        self.img_dim = img_dim
        num_samples = x.size(0) # batch_size
        grid_size = x.size(2) # 尺度, grid_size*grid_size爲特徵圖的網格數量

        prediction = ( # [batch_size, n_anchor_box, grid_size, grid_size, n_classes+5]
            x.view(num_samples, self.num_anchors, self.num_classes + 5, grid_size, grid_size)
            .permute(0, 1, 3, 4, 2)
            .contiguous()
        )

        # 神經網絡的預測 x,y,w,h,c,cls
        x = torch.sigmoid(prediction[..., 0])  # 中心點距離當前網格左上角的x方向的偏移量,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
        y = torch.sigmoid(prediction[..., 1])  # 中心點距離當前網格左上角的y方向的偏移量,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
        w = prediction[..., 2]  # 預測出的box的weight,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
        h = prediction[..., 3]  # 預測出的box的Height,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
        pred_conf = torch.sigmoid(prediction[..., 4])  # 包含物體的置信度, [batch_size, n_anchorbox, grid_size, grid_size]
        pred_cls = torch.sigmoid(prediction[..., 5:])  # 類別概率, [batch_size, n_anchorbox, grid_size, grid_size, n_classes]

        # If grid size does not match current we compute new offsets
        if grid_size != self.grid_size:
            self.compute_grid_offsets(grid_size, cuda=x.is_cuda)

        # Add offset and scale with anchors
        pred_boxes = FloatTensor(prediction[..., :4].shape) # [batch_size, n_anchorbox, grid_size, grid_size, 4], 單位是格子數
        pred_boxes[..., 0] = x.data + self.grid_x
        pred_boxes[..., 1] = y.data + self.grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

        output = torch.cat(  # 轉換爲像素爲單位,[batch_size, grid_size*grid_size*n_anchorbox, 5+n_classes]
            (
                pred_boxes.view(num_samples, -1, 4) * self.stride,
                pred_conf.view(num_samples, -1, 1),
                pred_cls.view(num_samples, -1, self.num_classes),
            ),
            -1,
        )

        if targets is None:
            return output, 0
        else:
            # iou_scores: [batch_size, n_anchorbox, grid_size, grid_size] 預測的box 和target box 的交併比
            # class_mask: [batch_size, n_anchorbox, grid_size, grid_size], 預測正確的class 爲true
            # obj_mask : [batch_size, n_anchorbox, grid_size, grid_size]
            # noobj_mask:  [batch_size, n_anchorbox, grid_size, grid_size]
            # tx: [batch_size, n_anchorbox, grid_size, grid_size]
            # ty: [batch_size, n_anchorbox, grid_size, grid_size]
            # tw: [batch_size, n_anchorbox, grid_size, grid_size]
            # th: [batch_size, n_anchorbox, grid_size, grid_size]
            # tcls :[batch_size, n_anchorbox, grid_size, grid_size, n_classes]
            # tconf: [batch_size, n_anchorbox, grid_size, grid_size]
            iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
                pred_boxes=pred_boxes,
                pred_cls=pred_cls,
                target=targets,
                anchors=self.scaled_anchors,
                ignore_thres=self.ignore_thres,
            )

            # Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
            loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
            loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
            loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
            loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])
            loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
            loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
            total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

            # Metrics
            cls_acc = 100 * class_mask[obj_mask].mean() # 正確率
            conf_obj = pred_conf[obj_mask].mean() # 有物體的平均置信度
            conf_noobj = pred_conf[noobj_mask].mean() # 無物體的平均置信度
            conf50 = (pred_conf > 0.5).float() # 置信度大於0.5的位置 [batch_size, n_anchorbox, grid_size, grid_size]
            iou50 = (iou_scores > 0.5).float() # 交併比大於0.5的位置 [batch_size, n_anchorbox, grid_size, grid_size]
            iou75 = (iou_scores > 0.75).float() # 交併比大於0.75的位置 [batch_size, n_anchorbox, grid_size, grid_size]
            detected_mask = conf50 * class_mask * tconf # 置信度大於0.5,並且預測的類別正確,並且obj_mask爲true
            precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
            recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
            recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)

            self.metrics = {
                "loss": to_cpu(total_loss).item(),
                "x": to_cpu(loss_x).item(),
                "y": to_cpu(loss_y).item(),
                "w": to_cpu(loss_w).item(),
                "h": to_cpu(loss_h).item(),
                "conf": to_cpu(loss_conf).item(),
                "cls": to_cpu(loss_cls).item(),
                "cls_acc": to_cpu(cls_acc).item(),
                "recall50": to_cpu(recall50).item(),
                "recall75": to_cpu(recall75).item(),
                "precision": to_cpu(precision).item(),
                "conf_obj": to_cpu(conf_obj).item(),
                "conf_noobj": to_cpu(conf_noobj).item(),
                "grid_size": grid_size,
            }

            return output, total_loss

def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
    """

    :param pred_boxes: 預測的box,單位是格子數 [batch_size, n_anchorbox, grid_size, grid_size, 4]
    :param pred_cls: 類別概率, [batch_size, n_anchorbox, grid_size, grid_size, n_classes]
    :param target: [n_boxes, 6], 第二個維度有6個值,分別爲: box所屬的圖片的index, 類別index, x, y, w, h
    :param anchors: [n_anchorbox, 2] ,第二個維度爲anchor box的weight和hight
    :param ignore_thres:
    :return:
    """
    ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
    FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor

    nB = pred_boxes.size(0) # batch_size
    nA = pred_boxes.size(1) # n_anchor_box
    nC = pred_cls.size(-1) # n_classes
    nG = pred_boxes.size(2) # grid_size

    # Output tensors
    obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1) # [batch_size, n_anchor_box, grid_size, grid_size]
    class_mask = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    tx = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    ty = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    tw = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    th = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
    tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]

    # Convert to position relative to box
    target_boxes = target[:, 2:6] * nG # [n_boxes, 4] 將box的相對座標轉換成以格子爲單位的座標,n_boxes爲一個batch樣本中框的總數
    gxy = target_boxes[:, :2] # target box中心點的座標,單位爲格子數, [n_target_box, 2]
    gwh = target_boxes[:, 2:] # target box的weight和height,單位爲格子數,[n_target_box, 2]
    # 找出於target box 交併比最大的anchor box,
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors]) # target box和anchor box計算交併比, [n_anchorbox, n_target_box]
    best_ious, best_n = ious.max(0) # 最大的iou,與target box 交併比最大的anchor box的index, [n_target_box], [n_target_box]
    # Separate target values
    b, target_labels = target[:, :2].long().t() # target box在批中的index,target box的物體類別index;  [n_target_box], [n_target_box]
    gx, gy = gxy.t() # target box中心點的座標,單位爲格子數,[n_target_box], [n_target_box]
    gw, gh = gwh.t() # target box的weight和height,單位爲格子數,[n_target_box], [n_target_box]
    gi, gj = gxy.long().t() # gi爲格子在x方向的index, gj爲格子在y方向的index,[n_target_box], [n_target_box]
    # Set masks
    obj_mask[b, best_n, gj, gi] = 1 # [batch_size, n_anchorbox, grid_size, grid_size]
    noobj_mask[b, best_n, gj, gi] = 0 # [batch_size, n_anchorbox, grid_size, grid_size]

    # Set noobj mask to zero where iou exceeds ignore threshold
    for i, anchor_ious in enumerate(ious.t()):
        noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

    # Coordinates
    tx[b, best_n, gj, gi] = gx - gx.floor() # target box 中心點在x方向上相對當前格子左上角的偏移[batch_size, n_anchorbox, grid_size, grid_size]
    ty[b, best_n, gj, gi] = gy - gy.floor() # target box 中心點在y方向上相對當前格子左上角的偏移[batch_size, n_anchorbox, grid_size, grid_size]
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16) # target box 的weight[batch_size, n_anchorbox, grid_size, grid_size]
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16) # target box 的height[batch_size, n_anchorbox, grid_size, grid_size]
    # One-hot encoding of label
    tcls[b, best_n, gj, gi, target_labels] = 1 # [batch_size, n_anchorbox, grid_size, grid_size, n_classes]
    # Compute label correctness and iou at best anchor
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float() # [batch_size, n_anchorbox, grid_size, grid_size]
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False) # [batch_size, n_anchorbox, grid_size, grid_size]

    tconf = obj_mask.float()
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf

 

   此函數定義損失函數,損失函數包括三個部分,座標損失,置信度損失,類別損失:

預測出框後,進行的後處理:

1.將置信度大於閾值的框挑選出來

2.計算每個框的score, score=置信度*分類概率

3.非極大值抑制nms

  • 將所有框根據score由大到小排序;
  • 取出score最大的框,與剩餘的框計算iou;
  • 如果iou大於閾值,並且識別出的物體的標籤與score最大的框識別出物體的標籤相同,標記爲invalid框;
  • 將invalid的框從候選框中刪除;
  • 根據invalid框的置信度,作爲權重,加權平均修正score最大的框的x,y,w,h;
  • 將修正後的框加入結果框列表;
  • 循環往復,知道沒有候選框。

參考:

[1] YOLOv3 深入理解
 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章