論文:YOLOv3: An Incremental Improvement
YOLO3主要的改進有:調整了網絡結構;利用多尺度特徵進行對象檢測;對象分類用Logistic取代了softmax。
新的網絡結構Darknet-53
在基本的圖像特徵提取方面,YOLO3採用了稱之爲Darknet-53的網絡結構(含有53個卷積層),它借鑑了殘差網絡residual network的做法,在一些層之間設置了快捷鏈路(shortcut connections)。
利用多尺度特徵進行對象檢測
YOLO2曾採用passthrough結構來檢測細粒度特徵,在YOLO3更進一步採用了3個不同尺度的特徵圖來進行對象檢測。
結合上圖看,卷積網絡在79層後,經過下方几個黃色的卷積層得到一種尺度的檢測結果。相比輸入圖像,這裏用於檢測的特徵圖有32倍的下采樣。比如輸入是416*416的話,這裏的特徵圖就是13*13了。由於下采樣倍數高,這裏特徵圖的感受野比較大,因此適合檢測圖像中尺寸比較大的對象。
爲了實現細粒度的檢測,第79層的特徵圖又開始作上採樣,然後與第61層特徵圖融合(Concatenation),這樣得到第91層較細粒度的特徵圖,同樣經過幾個卷積層後得到相對輸入圖像16倍下采樣的特徵圖。它具有中等尺度的感受野,適合檢測中等尺度的對象。
最後,第91層特徵圖再次上採樣,並與第36層特徵圖融合(Concatenation),最後得到相對輸入圖像8倍下采樣的特徵圖。它的感受野最小,適合檢測小尺寸的對象。
9種尺度的先驗框
隨着輸出的特徵圖的數量和尺度的變化,先驗框的尺寸也需要相應的調整。YOLO2已經開始採用K-means聚類得到先驗框的尺寸,YOLO3延續了這種方法,爲每種下采樣尺度設定3種先驗框,總共聚類出9種尺寸的先驗框。在COCO數據集這9個先驗框是:(10x13),(16x30),(33x23),(30x61),(62x45),(59x119),(116x90),(156x198),(373x326)。
分配上,在最小的13*13特徵圖上(有最大的感受野)應用較大的先驗框(116x90),(156x198),(373x326),適合檢測較大的對象。中等的26*26特徵圖上(中等感受野)應用中等的先驗框(30x61),(62x45),(59x119),適合檢測中等大小的對象。較大的52*52特徵圖上(較小的感受野)應用較小的先驗框(10x13),(16x30),(33x23),適合檢測較小的對象。
感受一下9種先驗框的尺寸,下圖中藍色框爲聚類得到的先驗框。黃色框式ground truth,紅框是對象中心點所在的網格。
對象分類softmax改成logistic
預測對象類別時不使用softmax,改成使用logistic的輸出進行預測。這樣能夠支持多標籤對象(比如一個人有Woman 和 Person兩個標籤)。
輸入映射到輸出
我們看一下YOLO3共進行了多少個預測。對於一個416*416的輸入圖像,在每個尺度的特徵圖的每個網格設置3個先驗框,總共有 13*13*3 + 26*26*3 + 52*52*3 = 10647 個預測。每一個預測是一個(4+1+80)=85維向量,這個85維向量包含邊框座標(4個數值),邊框置信度(1個數值),對象類別的概率(對於COCO數據集,有80種對象)。
對比一下,YOLO2採用13*13*5 = 845個預測,YOLO3的嘗試預測邊框數量增加了10多倍,而且是在不同分辨率上進行,所以mAP以及對小物體的檢測效果有一定的提升。
數據處理
Train
class YOLOLayer(nn.Module):
"""Detection layer"""
def __init__(self, anchors, num_classes, img_dim=416):
super(YOLOLayer, self).__init__()
self.anchors = anchors
self.num_anchors = len(anchors)
self.num_classes = num_classes
self.ignore_thres = 0.5
self.mse_loss = nn.MSELoss()
self.bce_loss = nn.BCELoss()
self.obj_scale = 1
self.noobj_scale = 100
self.metrics = {}
self.img_dim = img_dim
self.grid_size = 0 # grid size
def compute_grid_offsets(self, grid_size, cuda=True):
"""
將anchor box轉換成以格子爲單位;計算每個格子的像素數
:param grid_size:
:param cuda:
:return:
"""
self.grid_size = grid_size
g = self.grid_size
FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
self.stride = self.img_dim / self.grid_size # 每個格子包含的resize後圖片的像素數
# Calculate offsets for each grid
self.grid_x = torch.arange(g).repeat(g, 1).view([1, 1, g, g]).type(FloatTensor) #[1, 1, grid_size, grid_size]用來計算中心點x方向的座標
self.grid_y = torch.arange(g).repeat(g, 1).t().view([1, 1, g, g]).type(FloatTensor) #[1, 1, grid_size, grid_size]用來計算中心點方向的座標
self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors]) # 將anchor box的座標由像素爲單位轉換成格子爲單位, [n_anchorbox, 2]
self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1)) # anchor box 的w , [1, n_anchorbox, 1, 1]
self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1)) # anchor box 的h , [1, n_anchorbox, 1, 1]
def forward(self, x, targets=None, img_dim=None):
# Tensors for cuda support
FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor
self.img_dim = img_dim
num_samples = x.size(0) # batch_size
grid_size = x.size(2) # 尺度, grid_size*grid_size爲特徵圖的網格數量
prediction = ( # [batch_size, n_anchor_box, grid_size, grid_size, n_classes+5]
x.view(num_samples, self.num_anchors, self.num_classes + 5, grid_size, grid_size)
.permute(0, 1, 3, 4, 2)
.contiguous()
)
# 神經網絡的預測 x,y,w,h,c,cls
x = torch.sigmoid(prediction[..., 0]) # 中心點距離當前網格左上角的x方向的偏移量,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
y = torch.sigmoid(prediction[..., 1]) # 中心點距離當前網格左上角的y方向的偏移量,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
w = prediction[..., 2] # 預測出的box的weight,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
h = prediction[..., 3] # 預測出的box的Height,單位爲x個grid_size大小的格子, [batch_size, n_anchorbox, grid_size, grid_size]
pred_conf = torch.sigmoid(prediction[..., 4]) # 包含物體的置信度, [batch_size, n_anchorbox, grid_size, grid_size]
pred_cls = torch.sigmoid(prediction[..., 5:]) # 類別概率, [batch_size, n_anchorbox, grid_size, grid_size, n_classes]
# If grid size does not match current we compute new offsets
if grid_size != self.grid_size:
self.compute_grid_offsets(grid_size, cuda=x.is_cuda)
# Add offset and scale with anchors
pred_boxes = FloatTensor(prediction[..., :4].shape) # [batch_size, n_anchorbox, grid_size, grid_size, 4], 單位是格子數
pred_boxes[..., 0] = x.data + self.grid_x
pred_boxes[..., 1] = y.data + self.grid_y
pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h
output = torch.cat( # 轉換爲像素爲單位,[batch_size, grid_size*grid_size*n_anchorbox, 5+n_classes]
(
pred_boxes.view(num_samples, -1, 4) * self.stride,
pred_conf.view(num_samples, -1, 1),
pred_cls.view(num_samples, -1, self.num_classes),
),
-1,
)
if targets is None:
return output, 0
else:
# iou_scores: [batch_size, n_anchorbox, grid_size, grid_size] 預測的box 和target box 的交併比
# class_mask: [batch_size, n_anchorbox, grid_size, grid_size], 預測正確的class 爲true
# obj_mask : [batch_size, n_anchorbox, grid_size, grid_size]
# noobj_mask: [batch_size, n_anchorbox, grid_size, grid_size]
# tx: [batch_size, n_anchorbox, grid_size, grid_size]
# ty: [batch_size, n_anchorbox, grid_size, grid_size]
# tw: [batch_size, n_anchorbox, grid_size, grid_size]
# th: [batch_size, n_anchorbox, grid_size, grid_size]
# tcls :[batch_size, n_anchorbox, grid_size, grid_size, n_classes]
# tconf: [batch_size, n_anchorbox, grid_size, grid_size]
iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
pred_boxes=pred_boxes,
pred_cls=pred_cls,
target=targets,
anchors=self.scaled_anchors,
ignore_thres=self.ignore_thres,
)
# Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])
loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls
# Metrics
cls_acc = 100 * class_mask[obj_mask].mean() # 正確率
conf_obj = pred_conf[obj_mask].mean() # 有物體的平均置信度
conf_noobj = pred_conf[noobj_mask].mean() # 無物體的平均置信度
conf50 = (pred_conf > 0.5).float() # 置信度大於0.5的位置 [batch_size, n_anchorbox, grid_size, grid_size]
iou50 = (iou_scores > 0.5).float() # 交併比大於0.5的位置 [batch_size, n_anchorbox, grid_size, grid_size]
iou75 = (iou_scores > 0.75).float() # 交併比大於0.75的位置 [batch_size, n_anchorbox, grid_size, grid_size]
detected_mask = conf50 * class_mask * tconf # 置信度大於0.5,並且預測的類別正確,並且obj_mask爲true
precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)
self.metrics = {
"loss": to_cpu(total_loss).item(),
"x": to_cpu(loss_x).item(),
"y": to_cpu(loss_y).item(),
"w": to_cpu(loss_w).item(),
"h": to_cpu(loss_h).item(),
"conf": to_cpu(loss_conf).item(),
"cls": to_cpu(loss_cls).item(),
"cls_acc": to_cpu(cls_acc).item(),
"recall50": to_cpu(recall50).item(),
"recall75": to_cpu(recall75).item(),
"precision": to_cpu(precision).item(),
"conf_obj": to_cpu(conf_obj).item(),
"conf_noobj": to_cpu(conf_noobj).item(),
"grid_size": grid_size,
}
return output, total_loss
def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
"""
:param pred_boxes: 預測的box,單位是格子數 [batch_size, n_anchorbox, grid_size, grid_size, 4]
:param pred_cls: 類別概率, [batch_size, n_anchorbox, grid_size, grid_size, n_classes]
:param target: [n_boxes, 6], 第二個維度有6個值,分別爲: box所屬的圖片的index, 類別index, x, y, w, h
:param anchors: [n_anchorbox, 2] ,第二個維度爲anchor box的weight和hight
:param ignore_thres:
:return:
"""
ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor
nB = pred_boxes.size(0) # batch_size
nA = pred_boxes.size(1) # n_anchor_box
nC = pred_cls.size(-1) # n_classes
nG = pred_boxes.size(2) # grid_size
# Output tensors
obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1) # [batch_size, n_anchor_box, grid_size, grid_size]
class_mask = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
tx = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
ty = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
tw = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
th = FloatTensor(nB, nA, nG, nG).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0) # [batch_size, n_anchor_box, grid_size, grid_size]
# Convert to position relative to box
target_boxes = target[:, 2:6] * nG # [n_boxes, 4] 將box的相對座標轉換成以格子爲單位的座標,n_boxes爲一個batch樣本中框的總數
gxy = target_boxes[:, :2] # target box中心點的座標,單位爲格子數, [n_target_box, 2]
gwh = target_boxes[:, 2:] # target box的weight和height,單位爲格子數,[n_target_box, 2]
# 找出於target box 交併比最大的anchor box,
ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors]) # target box和anchor box計算交併比, [n_anchorbox, n_target_box]
best_ious, best_n = ious.max(0) # 最大的iou,與target box 交併比最大的anchor box的index, [n_target_box], [n_target_box]
# Separate target values
b, target_labels = target[:, :2].long().t() # target box在批中的index,target box的物體類別index; [n_target_box], [n_target_box]
gx, gy = gxy.t() # target box中心點的座標,單位爲格子數,[n_target_box], [n_target_box]
gw, gh = gwh.t() # target box的weight和height,單位爲格子數,[n_target_box], [n_target_box]
gi, gj = gxy.long().t() # gi爲格子在x方向的index, gj爲格子在y方向的index,[n_target_box], [n_target_box]
# Set masks
obj_mask[b, best_n, gj, gi] = 1 # [batch_size, n_anchorbox, grid_size, grid_size]
noobj_mask[b, best_n, gj, gi] = 0 # [batch_size, n_anchorbox, grid_size, grid_size]
# Set noobj mask to zero where iou exceeds ignore threshold
for i, anchor_ious in enumerate(ious.t()):
noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0
# Coordinates
tx[b, best_n, gj, gi] = gx - gx.floor() # target box 中心點在x方向上相對當前格子左上角的偏移[batch_size, n_anchorbox, grid_size, grid_size]
ty[b, best_n, gj, gi] = gy - gy.floor() # target box 中心點在y方向上相對當前格子左上角的偏移[batch_size, n_anchorbox, grid_size, grid_size]
# Width and height
tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16) # target box 的weight[batch_size, n_anchorbox, grid_size, grid_size]
th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16) # target box 的height[batch_size, n_anchorbox, grid_size, grid_size]
# One-hot encoding of label
tcls[b, best_n, gj, gi, target_labels] = 1 # [batch_size, n_anchorbox, grid_size, grid_size, n_classes]
# Compute label correctness and iou at best anchor
class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float() # [batch_size, n_anchorbox, grid_size, grid_size]
iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False) # [batch_size, n_anchorbox, grid_size, grid_size]
tconf = obj_mask.float()
return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf
此函數定義損失函數,損失函數包括三個部分,座標損失,置信度損失,類別損失:
預測出框後,進行的後處理:
1.將置信度大於閾值的框挑選出來
2.計算每個框的score, score=置信度*分類概率
3.非極大值抑制nms
- 將所有框根據score由大到小排序;
- 取出score最大的框,與剩餘的框計算iou;
- 如果iou大於閾值,並且識別出的物體的標籤與score最大的框識別出物體的標籤相同,標記爲invalid框;
- 將invalid的框從候選框中刪除;
- 根據invalid框的置信度,作爲權重,加權平均修正score最大的框的x,y,w,h;
- 將修正後的框加入結果框列表;
- 循環往復,知道沒有候選框。
參考:
[1] YOLOv3 深入理解