睿智的目標檢測35——Pytorch 搭建YoloV4-Tiny目標檢測平臺
學習前言
還有Pyorch版本的。
什麼是YOLOV4-Tiny
YOLOV4是YOLOV3的改進版,在YOLOV3的基礎上結合了非常多的小Tricks。
儘管沒有目標檢測上革命性的改變,但是YOLOV4依然很好的結合了速度與精度。
根據上圖也可以看出來,YOLOV4在YOLOV3的基礎上,在FPS不下降的情況下,mAP達到了44,提高非常明顯。
YOLOV4整體上的檢測思路和YOLOV3相比相差並不大,都是使用三個特徵層進行分類與迴歸預測。
YoloV4-Tiny是YoloV4的簡化版,少了一些結構,但是速度大大增加了,YoloV4共有約6000萬參數,YoloV4-Tiny則只有600萬參數。
YoloV4-Tiny僅使用了兩個特徵層進行分類與迴歸預測。
代碼下載
https://github.com/bubbliiiing/yolov4-tiny-pytorch
喜歡的可以給個star噢!
YoloV4-Tiny結構解析
1、主幹特徵提取網絡Backbone
當輸入是416x416時,特徵結構如下:
當輸入是608x608時,特徵結構如下:
而在YoloV4-Tiny中,其使用了CSPdarknet53_tiny作爲主幹特徵提取網絡。
和CSPdarknet53相比,爲了更快速,將激活函數重新修改爲LeakyReLU。
CSPdarknet53_tiny具有兩個特點:
1、使用了CSPnet結構。
CSPnet結構並不算複雜,就是將原來的殘差塊的堆疊進行了一個拆分,拆成左右兩部分:
主幹部分繼續進行原來的殘差塊的堆疊;
另一部分則像一個殘差邊一樣,經過少量處理直接連接到最後。
因此可以認爲CSP中存在一個大的殘差邊。
2、進行通道的分割
在CSPnet的主幹部分,CSPdarknet53_tiny會對一次3x3卷積後的特徵層進行通道的劃分,分成兩部分,取第二部分。
#---------------------------------------------------#
# CSPdarknet53-tiny的結構塊
# 存在一個大殘差邊
# 這個大殘差邊繞過了很多的殘差結構
#---------------------------------------------------#
class Resblock_body(nn.Module):
def __init__(self, in_channels, out_channels):
super(Resblock_body, self).__init__()
self.conv1 = BasicConv(in_channels, out_channels, 3)
self.conv2 = BasicConv(out_channels//2, out_channels//2, 3)
self.conv3 = BasicConv(out_channels//2, out_channels//2, 3)
self.conv4 = BasicConv(out_channels, out_channels, 1)
self.maxpool = nn.MaxPool2d([2,2],[2,2])
def forward(self, x):
x = self.conv1(x)
route = x
_, c, _, _ = x.size()
x = torch.split(x, c//2, dim=1)[1]
x = self.conv2(x)
route1 = x
x = self.conv3(x)
x = torch.cat([x,route1], dim = 1)
x = self.conv4(x)
feat = x
x = torch.cat([route, x], dim=1)
x = self.maxpool(x)
return x,feat
利用主幹特徵提取網絡,我們可以獲得兩個shape的有效特徵層,即CSPdarknet53_tiny最後兩個shape的有效特徵層,傳入加強特徵提取網絡當中進行FPN的構建。
全部實現代碼爲:
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from collections import OrderedDict
#-------------------------------------------------#
# 卷積塊
# CONV+BATCHNORM+LeakyReLU
#-------------------------------------------------#
class BasicConv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1):
super(BasicConv, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, kernel_size//2, bias=False)
self.bn = nn.BatchNorm2d(out_channels)
self.activation = nn.LeakyReLU(0.1)
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.activation(x)
return x
#---------------------------------------------------#
# CSPdarknet53-tiny的結構塊
# 存在一個大殘差邊
# 這個大殘差邊繞過了很多的殘差結構
#---------------------------------------------------#
class Resblock_body(nn.Module):
def __init__(self, in_channels, out_channels):
super(Resblock_body, self).__init__()
self.conv1 = BasicConv(in_channels, out_channels, 3)
self.conv2 = BasicConv(out_channels//2, out_channels//2, 3)
self.conv3 = BasicConv(out_channels//2, out_channels//2, 3)
self.conv4 = BasicConv(out_channels, out_channels, 1)
self.maxpool = nn.MaxPool2d([2,2],[2,2])
def forward(self, x):
x = self.conv1(x)
route = x
_, c, _, _ = x.size()
x = torch.split(x, c//2, dim=1)[1]
x = self.conv2(x)
route1 = x
x = self.conv3(x)
x = torch.cat([x,route1], dim = 1)
x = self.conv4(x)
feat = x
x = torch.cat([route, x], dim=1)
x = self.maxpool(x)
return x,feat
class CSPDarkNet(nn.Module):
def __init__(self):
super(CSPDarkNet, self).__init__()
self.conv1 = BasicConv(3, 32, kernel_size=3, stride=2)
self.conv2 = BasicConv(32, 64, kernel_size=3, stride=2)
self.resblock_body1 = Resblock_body(64, 64)
self.resblock_body2 = Resblock_body(128, 128)
self.resblock_body3 = Resblock_body(256, 256)
self.conv3 = BasicConv(512, 512, kernel_size=3)
self.num_features = 1
# 進行權值初始化
for m in self.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x, _ = self.resblock_body1(x)
x, _ = self.resblock_body2(x)
x, feat1 = self.resblock_body3(x)
x = self.conv3(x)
feat2 = x
return feat1,feat2
def darknet53_tiny(pretrained, **kwargs):
model = CSPDarkNet()
if pretrained:
if isinstance(pretrained, str):
model.load_state_dict(torch.load(pretrained))
else:
raise Exception("darknet request a pretrained path. got [{}]".format(pretrained))
return model
2、特徵金字塔
當輸入是416x416時,特徵結構如下:
當輸入是608x608時,特徵結構如下:
YoloV4-Tiny中使用了FPN的結構,主要是對第一步獲得的兩個有效特徵層進行特徵融合。
FPN會將最後一個shape的有效特徵層卷積後進行上採樣,然後與上一個shape的有效特徵層進行堆疊並卷積。
實現代碼如下:
import torch
import torch.nn as nn
from collections import OrderedDict
from nets.CSPdarknet53_tiny import darknet53_tiny
#-------------------------------------------------#
# 卷積塊
# CONV+BATCHNORM+LeakyReLU
#-------------------------------------------------#
class BasicConv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1):
super(BasicConv, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, kernel_size//2, bias=False)
self.bn = nn.BatchNorm2d(out_channels)
self.activation = nn.LeakyReLU(0.1)
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.activation(x)
return x
#---------------------------------------------------#
# 卷積 + 上採樣
#---------------------------------------------------#
class Upsample(nn.Module):
def __init__(self, in_channels, out_channels):
super(Upsample, self).__init__()
self.upsample = nn.Sequential(
BasicConv(in_channels, out_channels, 1),
nn.Upsample(scale_factor=2, mode='nearest')
)
def forward(self, x,):
x = self.upsample(x)
return x
#---------------------------------------------------#
# 最後獲得yolov4的輸出
#---------------------------------------------------#
def yolo_head(filters_list, in_filters):
m = nn.Sequential(
BasicConv(in_filters, filters_list[0], 3),
nn.Conv2d(filters_list[0], filters_list[1], 1),
)
return m
#---------------------------------------------------#
# yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
def __init__(self, num_anchors, num_classes):
super(YoloBody, self).__init__()
# backbone
self.backbone = darknet53_tiny(None)
self.conv_for_P5 = BasicConv(512,256,1)
self.yolo_headP5 = yolo_head([512, num_anchors * (5 + num_classes)],256)
self.upsample = Upsample(256,128)
self.yolo_headP4 = yolo_head([256, num_anchors * (5 + num_classes)],384)
def forward(self, x):
# backbone
feat1, feat2 = self.backbone(x)
P5 = self.conv_for_P5(feat2)
out0 = self.yolo_headP5(P5)
P5_Upsample = self.upsample(P5)
P4 = torch.cat([feat1,P5_Upsample],axis=1)
out1 = self.yolo_headP4(P4)
return out0, out1
3、YoloHead利用獲得到的特徵進行預測
當輸入是416x416時,特徵結構如下:
當輸入是608x608時,特徵結構如下:
1、在特徵利用部分,YoloV4-Tiny提取多特徵層進行目標檢測,一共提取兩個特徵層,兩個特徵層的shape分別爲(38,38,128)、(19,19,512)。
2、輸出層的shape分別爲(19,19,75),(38,38,75),最後一個維度爲75是因爲該圖是基於voc數據集的,它的類爲20種,YoloV4-Tiny只有針對每一個特徵層存在3個先驗框,所以最後維度爲3x25;
如果使用的是coco訓練集,類則爲80種,最後的維度應該爲255 = 3x85,兩個特徵層的shape爲(19,19,255),(38,38,255)
實現代碼如下:
import torch
import torch.nn as nn
from collections import OrderedDict
from nets.CSPdarknet53_tiny import darknet53_tiny
#-------------------------------------------------#
# 卷積塊
# CONV+BATCHNORM+LeakyReLU
#-------------------------------------------------#
class BasicConv(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1):
super(BasicConv, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, kernel_size//2, bias=False)
self.bn = nn.BatchNorm2d(out_channels)
self.activation = nn.LeakyReLU(0.1)
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.activation(x)
return x
#---------------------------------------------------#
# 卷積 + 上採樣
#---------------------------------------------------#
class Upsample(nn.Module):
def __init__(self, in_channels, out_channels):
super(Upsample, self).__init__()
self.upsample = nn.Sequential(
BasicConv(in_channels, out_channels, 1),
nn.Upsample(scale_factor=2, mode='nearest')
)
def forward(self, x,):
x = self.upsample(x)
return x
#---------------------------------------------------#
# 最後獲得yolov4的輸出
#---------------------------------------------------#
def yolo_head(filters_list, in_filters):
m = nn.Sequential(
BasicConv(in_filters, filters_list[0], 3),
nn.Conv2d(filters_list[0], filters_list[1], 1),
)
return m
#---------------------------------------------------#
# yolo_body
#---------------------------------------------------#
class YoloBody(nn.Module):
def __init__(self, num_anchors, num_classes):
super(YoloBody, self).__init__()
# backbone
self.backbone = darknet53_tiny(None)
self.conv_for_P5 = BasicConv(512,256,1)
self.yolo_headP5 = yolo_head([512, num_anchors * (5 + num_classes)],256)
self.upsample = Upsample(256,128)
self.yolo_headP4 = yolo_head([256, num_anchors * (5 + num_classes)],384)
def forward(self, x):
# backbone
feat1, feat2 = self.backbone(x)
P5 = self.conv_for_P5(feat2)
out0 = self.yolo_headP5(P5)
P5_Upsample = self.upsample(P5)
P4 = torch.cat([feat1,P5_Upsample],axis=1)
out1 = self.yolo_headP4(P4)
return out0, out1
4、預測結果的解碼
由第三步我們可以獲得兩個特徵層的預測結果,shape分別爲(N,19,19,255),(N,38,38,255)的數據,對應每個圖分爲19x19、38x38的網格上3個預測框的位置。
但是這個預測結果並不對應着最終的預測框在圖片上的位置,還需要解碼纔可以完成。
此處要講一下yolo的預測原理,yolo的特徵層分別將整幅圖分爲19x19、38x38的網格,每個網絡點負責一個區域的檢測。
我們知道特徵層的預測結果對應着三個預測框的位置,我們先將其reshape一下,其結果爲(N,19,19,3,85),(N,38,38,3,85)。
最後一個維度中的85包含了4+1+80,分別代表x_offset、y_offset、h和w、置信度、分類結果。
yolo的解碼過程就是將每個網格點加上它對應的x_offset和y_offset,加完後的結果就是預測框的中心,然後再利用 先驗框和h、w結合 計算出預測框的長和寬。這樣就能得到整個預測框的位置了。
當然得到最終的預測結構後還要進行得分排序與非極大抑制篩選
這一部分基本上是所有目標檢測通用的部分。不過該項目的處理方式與其它項目不同。其對於每一個類進行判別。
1、取出每一類得分大於self.obj_threshold的框和得分。
2、利用框的位置和得分進行非極大抑制。
實現代碼如下:
class DecodeBox(nn.Module):
def __init__(self, anchors, num_classes, img_size):
super(DecodeBox, self).__init__()
self.anchors = anchors
self.num_anchors = len(anchors)
self.num_classes = num_classes
self.bbox_attrs = 5 + num_classes
self.img_size = img_size
def forward(self, input):
# input爲bs,3*(1+4+num_classes),13,13
# 一共多少張圖片
batch_size = input.size(0)
# 13,13
input_height = input.size(2)
input_width = input.size(3)
# 計算步長
# 每一個特徵點對應原來的圖片上多少個像素點
# 如果特徵層爲13x13的話,一個特徵點就對應原來的圖片上的32個像素點
# 416/13 = 32
stride_h = self.img_size[1] / input_height
stride_w = self.img_size[0] / input_width
# 把先驗框的尺寸調整成特徵層大小的形式
# 計算出先驗框在特徵層上對應的寬高
scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors]
# bs,3*(5+num_classes),13,13 -> bs,3,13,13,(5+num_classes)
prediction = input.view(batch_size, self.num_anchors,
self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()
# 先驗框的中心位置的調整參數
x = torch.sigmoid(prediction[..., 0])
y = torch.sigmoid(prediction[..., 1])
# 先驗框的寬高調整參數
w = prediction[..., 2] # Width
h = prediction[..., 3] # Height
# 獲得置信度,是否有物體
conf = torch.sigmoid(prediction[..., 4])
# 種類置信度
pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred.
FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
# 生成網格,先驗框中心,網格左上角 batch_size,3,13,13
grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_width, 1).repeat(
batch_size * self.num_anchors, 1, 1).view(x.shape).type(FloatTensor)
grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_height, 1).t().repeat(
batch_size * self.num_anchors, 1, 1).view(y.shape).type(FloatTensor)
# 生成先驗框的寬高
anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)
anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)
# 計算調整後的先驗框中心與寬高
pred_boxes = FloatTensor(prediction[..., :4].shape)
pred_boxes[..., 0] = x.data + grid_x
pred_boxes[..., 1] = y.data + grid_y
pred_boxes[..., 2] = torch.exp(w.data) * anchor_w
pred_boxes[..., 3] = torch.exp(h.data) * anchor_h
# 用於將輸出調整爲相對於416x416的大小
_scale = torch.Tensor([stride_w, stride_h] * 2).type(FloatTensor)
output = torch.cat((pred_boxes.view(batch_size, -1, 4) * _scale,
conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)
return output.data
def bbox_iou(box1, box2, x1y1x2y2=True):
"""
計算IOU
"""
if not x1y1x2y2:
b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
else:
b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]
inter_rect_x1 = torch.max(b1_x1, b2_x1)
inter_rect_y1 = torch.max(b1_y1, b2_y1)
inter_rect_x2 = torch.min(b1_x2, b2_x2)
inter_rect_y2 = torch.min(b1_y2, b2_y2)
inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * \
torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)
b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)
iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)
return iou
def non_max_suppression(prediction, num_classes, conf_thres=0.5, nms_thres=0.4):
# 求左上角和右下角
box_corner = prediction.new(prediction.shape)
box_corner[:, :, 0] = prediction[:, :, 0] - prediction[:, :, 2] / 2
box_corner[:, :, 1] = prediction[:, :, 1] - prediction[:, :, 3] / 2
box_corner[:, :, 2] = prediction[:, :, 0] + prediction[:, :, 2] / 2
box_corner[:, :, 3] = prediction[:, :, 1] + prediction[:, :, 3] / 2
prediction[:, :, :4] = box_corner[:, :, :4]
output = [None for _ in range(len(prediction))]
for image_i, image_pred in enumerate(prediction):
# 利用置信度進行第一輪篩選
conf_mask = (image_pred[:, 4] >= conf_thres).squeeze()
image_pred = image_pred[conf_mask]
if not image_pred.size(0):
continue
# 獲得種類及其置信度
class_conf, class_pred = torch.max(image_pred[:, 5:5 + num_classes], 1, keepdim=True)
# 獲得的內容爲(x1, y1, x2, y2, obj_conf, class_conf, class_pred)
detections = torch.cat((image_pred[:, :5], class_conf.float(), class_pred.float()), 1)
# 獲得種類
unique_labels = detections[:, -1].cpu().unique()
if prediction.is_cuda:
unique_labels = unique_labels.cuda()
for c in unique_labels:
# 獲得某一類初步篩選後全部的預測結果
detections_class = detections[detections[:, -1] == c]
# 按照存在物體的置信度排序
_, conf_sort_index = torch.sort(detections_class[:, 4], descending=True)
detections_class = detections_class[conf_sort_index]
# 進行非極大抑制
max_detections = []
while detections_class.size(0):
# 取出這一類置信度最高的,一步一步往下判斷,判斷重合程度是否大於nms_thres,如果是則去除掉
max_detections.append(detections_class[0].unsqueeze(0))
if len(detections_class) == 1:
break
ious = bbox_iou(max_detections[-1], detections_class[1:])
detections_class = detections_class[1:][ious < nms_thres]
# 堆疊
max_detections = torch.cat(max_detections).data
# Add max detections to outputs
output[image_i] = max_detections if output[image_i] is None else torch.cat(
(output[image_i], max_detections))
return output
5、在原圖上進行繪製
通過第四步,我們可以獲得預測框在原圖上的位置,而且這些預測框都是經過篩選的。這些篩選後的框可以直接繪製在圖片上,就可以獲得結果了。
YoloV4-Tiny的訓練
1、YOLOV4的改進訓練技巧
a)、Mosaic數據增強
Yolov4的mosaic數據增強參考了CutMix數據增強方式,理論上具有一定的相似性!
CutMix數據增強方式利用兩張圖片進行拼接。
但是mosaic利用了四張圖片,根據論文所說其擁有一個巨大的優點是豐富檢測物體的背景!且在BN計算的時候一下子會計算四張圖片的數據!
就像下圖這樣:
實現思路如下:
1、每次讀取四張圖片。
2、分別對四張圖片進行翻轉、縮放、色域變化等,並且按照四個方向位置擺好。
3、進行圖片的組合和框的組合
def get_random_data_with_Mosaic(self, annotation_line, input_shape, hue=.1, sat=1.5, val=1.5):
'''random preprocessing for real-time data augmentation'''
h, w = input_shape
min_offset_x = 0.4
min_offset_y = 0.4
scale_low = 1-min(min_offset_x,min_offset_y)
scale_high = scale_low+0.2
image_datas = []
box_datas = []
index = 0
place_x = [0,0,int(w*min_offset_x),int(w*min_offset_x)]
place_y = [0,int(h*min_offset_y),int(w*min_offset_y),0]
for line in annotation_line:
# 每一行進行分割
line_content = line.split()
# 打開圖片
image = Image.open(line_content[0])
image = image.convert("RGB")
# 圖片的大小
iw, ih = image.size
# 保存框的位置
box = np.array([np.array(list(map(int,box.split(',')))) for box in line_content[1:]])
# 是否翻轉圖片
flip = rand()<.5
if flip and len(box)>0:
image = image.transpose(Image.FLIP_LEFT_RIGHT)
box[:, [0,2]] = iw - box[:, [2,0]]
# 對輸入進來的圖片進行縮放
new_ar = w/h
scale = rand(scale_low, scale_high)
if new_ar < 1:
nh = int(scale*h)
nw = int(nh*new_ar)
else:
nw = int(scale*w)
nh = int(nw/new_ar)
image = image.resize((nw,nh), Image.BICUBIC)
# 進行色域變換
hue = rand(-hue, hue)
sat = rand(1, sat) if rand()<.5 else 1/rand(1, sat)
val = rand(1, val) if rand()<.5 else 1/rand(1, val)
x = rgb_to_hsv(np.array(image)/255.)
x[..., 0] += hue
x[..., 0][x[..., 0]>1] -= 1
x[..., 0][x[..., 0]<0] += 1
x[..., 1] *= sat
x[..., 2] *= val
x[x>1] = 1
x[x<0] = 0
image = hsv_to_rgb(x)
image = Image.fromarray((image*255).astype(np.uint8))
# 將圖片進行放置,分別對應四張分割圖片的位置
dx = place_x[index]
dy = place_y[index]
new_image = Image.new('RGB', (w,h), (128,128,128))
new_image.paste(image, (dx, dy))
image_data = np.array(new_image)
index = index + 1
box_data = []
# 對box進行重新處理
if len(box)>0:
np.random.shuffle(box)
box[:, [0,2]] = box[:, [0,2]]*nw/iw + dx
box[:, [1,3]] = box[:, [1,3]]*nh/ih + dy
box[:, 0:2][box[:, 0:2]<0] = 0
box[:, 2][box[:, 2]>w] = w
box[:, 3][box[:, 3]>h] = h
box_w = box[:, 2] - box[:, 0]
box_h = box[:, 3] - box[:, 1]
box = box[np.logical_and(box_w>1, box_h>1)]
box_data = np.zeros((len(box),5))
box_data[:len(box)] = box
image_datas.append(image_data)
box_datas.append(box_data)
# 將圖片分割,放在一起
cutx = np.random.randint(int(w*min_offset_x), int(w*(1 - min_offset_x)))
cuty = np.random.randint(int(h*min_offset_y), int(h*(1 - min_offset_y)))
new_image = np.zeros([h,w,3])
new_image[:cuty, :cutx, :] = image_datas[0][:cuty, :cutx, :]
new_image[cuty:, :cutx, :] = image_datas[1][cuty:, :cutx, :]
new_image[cuty:, cutx:, :] = image_datas[2][cuty:, cutx:, :]
new_image[:cuty, cutx:, :] = image_datas[3][:cuty, cutx:, :]
# 對框進行進一步的處理
new_boxes = np.array(merge_bboxes(box_datas, cutx, cuty))
if len(new_boxes) == 0:
return new_image, []
if (new_boxes[:,:4]>0).any():
return new_image, new_boxes
else:
return new_image, []
b)、Label Smoothing平滑
標籤平滑的思想很簡單,具體公式如下:
new_onehot_labels = onehot_labels * (1 - label_smoothing) + label_smoothing / num_classes
當label_smoothing的值爲0.01得時候,公式變成如下所示:
new_onehot_labels = y * (1 - 0.01) + 0.01 / num_classes
其實Label Smoothing平滑就是將標籤進行一個平滑,原始的標籤是0、1,在平滑後變成0.005(如果是二分類)、0.995,也就是說對分類準確做了一點懲罰,讓模型不可以分類的太準確,太準確容易過擬合。
實現代碼如下:
#---------------------------------------------------#
# 平滑標籤
#---------------------------------------------------#
def smooth_labels(y_true, label_smoothing,num_classes):
return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes
c)、CIOU
IoU是比值的概念,對目標物體的scale是不敏感的。然而常用的BBox的迴歸損失優化和IoU優化不是完全等價的,尋常的IoU無法直接優化沒有重疊的部分。
於是有人提出直接使用IOU作爲迴歸優化loss,CIOU是其中非常優秀的一種想法。
CIOU將目標與anchor之間的距離,重疊率、尺度以及懲罰項都考慮進去,使得目標框迴歸變得更加穩定,不會像IoU和GIoU一樣出現訓練過程中發散等問題。而懲罰因子把預測框長寬比擬合目標框的長寬比考慮進去。
CIOU公式如下
其中,分別代表了預測框和真實框的中心點的歐式距離。 c代表的是能夠同時包含預測框和真實框的最小閉包區域的對角線距離。
而和的公式如下
把1-CIOU就可以得到相應的LOSS了。
def box_ciou(b1, b2):
"""
輸入爲:
----------
b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
返回爲:
-------
ciou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
"""
# 求出預測框左上角右下角
b1_xy = b1[..., :2]
b1_wh = b1[..., 2:4]
b1_wh_half = b1_wh/2.
b1_mins = b1_xy - b1_wh_half
b1_maxes = b1_xy + b1_wh_half
# 求出真實框左上角右下角
b2_xy = b2[..., :2]
b2_wh = b2[..., 2:4]
b2_wh_half = b2_wh/2.
b2_mins = b2_xy - b2_wh_half
b2_maxes = b2_xy + b2_wh_half
# 求真實框和預測框所有的iou
intersect_mins = torch.max(b1_mins, b2_mins)
intersect_maxes = torch.min(b1_maxes, b2_maxes)
intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
b1_area = b1_wh[..., 0] * b1_wh[..., 1]
b2_area = b2_wh[..., 0] * b2_wh[..., 1]
union_area = b1_area + b2_area - intersect_area
iou = intersect_area / torch.clamp(union_area,min = 1e-6)
# 計算中心的差距
center_distance = torch.sum(torch.pow((b1_xy - b2_xy), 2), axis=-1)
# 找到包裹兩個框的最小框的左上角和右下角
enclose_mins = torch.min(b1_mins, b2_mins)
enclose_maxes = torch.max(b1_maxes, b2_maxes)
enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))
# 計算對角線距離
enclose_diagonal = torch.sum(torch.pow(enclose_wh,2), axis=-1)
ciou = iou - 1.0 * (center_distance) / torch.clamp(enclose_diagonal,min = 1e-6)
v = (4 / (math.pi ** 2)) * torch.pow((torch.atan(b1_wh[..., 0]/torch.clamp(b1_wh[..., 1],min = 1e-6)) - torch.atan(b2_wh[..., 0]/torch.clamp(b2_wh[..., 1],min = 1e-6))), 2)
alpha = v / torch.clamp((1.0 - iou + v),min=1e-6)
ciou = ciou - alpha * v
return ciou
d)、學習率餘弦退火衰減
餘弦退火衰減法,學習率會先上升再下降,這是退火優化法的思想。(關於什麼是退火算法可以百度。)
上升的時候使用線性上升,下降的時候模擬cos函數下降。執行多次。
效果如圖所示:
餘弦退火衰減有幾個比較必要的參數:
1、learning_rate_base:學習率最高值。
2、warmup_learning_rate:最開始的學習率。
3、warmup_steps:多少步長後到達頂峯值。
實現方式如下,利用Callback實現,與普通的ReduceLROnPlateau調用方式類似:
lr_scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5, eta_min=1e-5)
2、loss組成
a)、計算loss所需參數
在計算loss的時候,實際上是y_pre和y_true之間的對比:
y_pre就是一幅圖像經過網絡之後的輸出,內部含有兩個特徵層的內容;其需要解碼才能夠在圖上作畫
y_true就是一個真實圖像中,它的每個真實框對應的(19,19)、(38,38)網格上的偏移位置、長寬與種類。其仍需要編碼才能與y_pred的結構一致
實際上y_pre和y_true內容的shape都是
(batch_size,19,19,3,85)
(batch_size,38,38,3,85)
b)、y_pre是什麼
網絡最後輸出的內容就是兩個特徵層每個網格點對應的預測框及其種類,即兩個特徵層分別對應着圖片被分爲不同size的網格後,每個網格點上三個先驗框對應的位置、置信度及其種類。
對於輸出的y1、y2、y3而言,[…, : 2]指的是相對於每個網格點的偏移量,[…, 2: 4]指的是寬和高,[…, 4: 5]指的是該框的置信度,[…, 5: ]指的是每個種類的預測概率。
現在的y_pre還是沒有解碼的,解碼了之後纔是真實圖像上的情況。
c)、y_true是什麼。
y_true就是一個真實圖像中,它的每個真實框對應的(19,19)、(38,38)網格上的偏移位置、長寬與種類。其仍需要編碼才能與y_pred的結構一致
d)、loss的計算過程
在得到了y_pre和y_true後怎麼對比呢?不是簡單的減一下!
loss值需要對倆個特徵層進行處理,這裏以最小的特徵層爲例。
1、利用y_true取出該特徵層中真實存在目標的點的位置(m,19,19,3,1)及其對應的種類(m,19,19,3,80)。
2、將prediction的預測值輸出進行處理,得到reshape後的預測值y_pre,shape爲(m,19,19,3,85)。還有解碼後的xy,wh。
3、對於每一幅圖,計算其中所有真實框與預測框的IOU,如果某些預測框和真實框的重合程度大於0.5,則忽略。
4、計算ciou作爲迴歸的loss,這裏只計算正樣本的迴歸loss。
5、計算置信度的loss,其有兩部分構成,第一部分是實際上存在目標的,預測結果中置信度的值與1對比;第二部分是實際上不存在目標的,在第四步中得到其最大IOU的值與0對比。
6、計算預測種類的loss,其計算的是實際上存在目標的,預測類與真實類的差距。
其實際上計算的總的loss是三個loss的和,這三個loss分別是:
- 實際存在的框,CIOU LOSS。
- 實際存在的框,預測結果中置信度的值與1對比;實際不存在的框,預測結果中置信度的值與0對比,該部分要去除被忽略的不包含目標的框。
- 實際存在的框,種類預測結果與實際結果的對比。
其實際代碼如下:
def jaccard(_box_a, _box_b):
b1_x1, b1_x2 = _box_a[:, 0] - _box_a[:, 2] / 2, _box_a[:, 0] + _box_a[:, 2] / 2
b1_y1, b1_y2 = _box_a[:, 1] - _box_a[:, 3] / 2, _box_a[:, 1] + _box_a[:, 3] / 2
b2_x1, b2_x2 = _box_b[:, 0] - _box_b[:, 2] / 2, _box_b[:, 0] + _box_b[:, 2] / 2
b2_y1, b2_y2 = _box_b[:, 1] - _box_b[:, 3] / 2, _box_b[:, 1] + _box_b[:, 3] / 2
box_a = torch.zeros_like(_box_a)
box_b = torch.zeros_like(_box_b)
box_a[:, 0], box_a[:, 1], box_a[:, 2], box_a[:, 3] = b1_x1, b1_y1, b1_x2, b1_y2
box_b[:, 0], box_b[:, 1], box_b[:, 2], box_b[:, 3] = b2_x1, b2_y1, b2_x2, b2_y2
A = box_a.size(0)
B = box_b.size(0)
max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),
box_b[:, 2:].unsqueeze(0).expand(A, B, 2))
min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),
box_b[:, :2].unsqueeze(0).expand(A, B, 2))
inter = torch.clamp((max_xy - min_xy), min=0)
inter = inter[:, :, 0] * inter[:, :, 1]
# 計算先驗框和真實框各自的面積
area_a = ((box_a[:, 2]-box_a[:, 0]) *
(box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter) # [A,B]
area_b = ((box_b[:, 2]-box_b[:, 0]) *
(box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter) # [A,B]
# 求IOU
union = area_a + area_b - inter
return inter / union # [A,B]
#---------------------------------------------------#
# 平滑標籤
#---------------------------------------------------#
def smooth_labels(y_true, label_smoothing,num_classes):
return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes
def box_ciou(b1, b2):
"""
輸入爲:
----------
b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh
返回爲:
-------
ciou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1)
"""
# 求出預測框左上角右下角
b1_xy = b1[..., :2]
b1_wh = b1[..., 2:4]
b1_wh_half = b1_wh/2.
b1_mins = b1_xy - b1_wh_half
b1_maxes = b1_xy + b1_wh_half
# 求出真實框左上角右下角
b2_xy = b2[..., :2]
b2_wh = b2[..., 2:4]
b2_wh_half = b2_wh/2.
b2_mins = b2_xy - b2_wh_half
b2_maxes = b2_xy + b2_wh_half
# 求真實框和預測框所有的iou
intersect_mins = torch.max(b1_mins, b2_mins)
intersect_maxes = torch.min(b1_maxes, b2_maxes)
intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes))
intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1]
b1_area = b1_wh[..., 0] * b1_wh[..., 1]
b2_area = b2_wh[..., 0] * b2_wh[..., 1]
union_area = b1_area + b2_area - intersect_area
iou = intersect_area / torch.clamp(union_area,min = 1e-6)
# 計算中心的差距
center_distance = torch.sum(torch.pow((b1_xy - b2_xy), 2), axis=-1)
# 找到包裹兩個框的最小框的左上角和右下角
enclose_mins = torch.min(b1_mins, b2_mins)
enclose_maxes = torch.max(b1_maxes, b2_maxes)
enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes))
# 計算對角線距離
enclose_diagonal = torch.sum(torch.pow(enclose_wh,2), axis=-1)
ciou = iou - 1.0 * (center_distance) / torch.clamp(enclose_diagonal,min = 1e-6)
v = (4 / (math.pi ** 2)) * torch.pow((torch.atan(b1_wh[..., 0]/torch.clamp(b1_wh[..., 1],min = 1e-6)) - torch.atan(b2_wh[..., 0]/torch.clamp(b2_wh[..., 1],min = 1e-6))), 2)
alpha = v / torch.clamp((1.0 - iou + v),min=1e-6)
ciou = ciou - alpha * v
return ciou
def clip_by_tensor(t,t_min,t_max):
t=t.float()
result = (t >= t_min).float() * t + (t < t_min).float() * t_min
result = (result <= t_max).float() * result + (result > t_max).float() * t_max
return result
def MSELoss(pred,target):
return (pred-target)**2
def BCELoss(pred,target):
epsilon = 1e-7
pred = clip_by_tensor(pred, epsilon, 1.0 - epsilon)
output = -target * torch.log(pred) - (1.0 - target) * torch.log(1.0 - pred)
return output
class YOLOLoss(nn.Module):
def __init__(self, anchors, num_classes, img_size, label_smooth=0, cuda=True):
super(YOLOLoss, self).__init__()
self.anchors = anchors
self.num_anchors = len(anchors)
self.num_classes = num_classes
self.bbox_attrs = 5 + num_classes
self.img_size = img_size
self.feature_length = [img_size[0]//32,img_size[0]//16]
self.label_smooth = label_smooth
self.ignore_threshold = 0.5
self.lambda_conf = 1.0
self.lambda_cls = 1.0
self.lambda_loc = 1.0
self.cuda = cuda
def forward(self, input, targets=None):
# input爲bs,3*(5+num_classes),13,13
# 一共多少張圖片
bs = input.size(0)
# 特徵層的高
in_h = input.size(2)
# 特徵層的寬
in_w = input.size(3)
# 計算步長
# 每一個特徵點對應原來的圖片上多少個像素點
# 如果特徵層爲13x13的話,一個特徵點就對應原來的圖片上的32個像素點
stride_h = self.img_size[1] / in_h
stride_w = self.img_size[0] / in_w
# 把先驗框的尺寸調整成特徵層大小的形式
# 計算出先驗框在特徵層上對應的寬高
scaled_anchors = [(a_w / stride_w, a_h / stride_h) for a_w, a_h in self.anchors]
# bs,3*(5+num_classes),13,13 -> bs,3,13,13,(5+num_classes)
prediction = input.view(bs, int(self.num_anchors/2),
self.bbox_attrs, in_h, in_w).permute(0, 1, 3, 4, 2).contiguous()
# 對prediction預測進行調整
conf = torch.sigmoid(prediction[..., 4]) # Conf
pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred.
# 找到哪些先驗框內部包含物體
mask, noobj_mask, t_box, tconf, tcls, box_loss_scale_x, box_loss_scale_y = self.get_target(targets, scaled_anchors,in_w, in_h,self.ignore_threshold)
noobj_mask, pred_boxes_for_ciou = self.get_ignore(prediction, targets, scaled_anchors, in_w, in_h, noobj_mask)
if self.cuda:
mask, noobj_mask = mask.cuda(), noobj_mask.cuda()
box_loss_scale_x, box_loss_scale_y= box_loss_scale_x.cuda(), box_loss_scale_y.cuda()
tconf, tcls = tconf.cuda(), tcls.cuda()
pred_boxes_for_ciou = pred_boxes_for_ciou.cuda()
t_box = t_box.cuda()
box_loss_scale = 2-box_loss_scale_x*box_loss_scale_y
# losses.
ciou = (1 - box_ciou( pred_boxes_for_ciou[mask.bool()], t_box[mask.bool()]))* box_loss_scale[mask.bool()]
loss_loc = torch.sum(ciou / bs)
loss_conf = torch.sum(BCELoss(conf, mask) * mask / bs) + \
torch.sum(BCELoss(conf, mask) * noobj_mask / bs)
# print(smooth_labels(tcls[mask == 1],self.label_smooth,self.num_classes))
loss_cls = torch.sum(BCELoss(pred_cls[mask == 1], smooth_labels(tcls[mask == 1],self.label_smooth,self.num_classes))/bs)
# print(loss_loc,loss_conf,loss_cls)
loss = loss_conf * self.lambda_conf + loss_cls * self.lambda_cls + loss_loc * self.lambda_loc
return loss, loss_conf.item(), loss_cls.item(), loss_loc.item()
def get_target(self, target, anchors, in_w, in_h, ignore_threshold):
# 計算一共有多少張圖片
bs = len(target)
# 獲得先驗框
anchor_index = [[3,4,5],[1,2,3]][self.feature_length.index(in_w)]
# 創建全是0或者全是1的陣列
mask = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
noobj_mask = torch.ones(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
tx = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
ty = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
tw = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
th = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
t_box = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, 4, requires_grad=False)
tconf = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
tcls = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, self.num_classes, requires_grad=False)
box_loss_scale_x = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
box_loss_scale_y = torch.zeros(bs, int(self.num_anchors/2), in_h, in_w, requires_grad=False)
for b in range(bs):
for t in range(target[b].shape[0]):
# 計算出在特徵層上的點位
gx = target[b][t, 0] * in_w
gy = target[b][t, 1] * in_h
gw = target[b][t, 2] * in_w
gh = target[b][t, 3] * in_h
# 計算出屬於哪個網格
gi = int(gx)
gj = int(gy)
# 計算真實框的位置
gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0)
# 計算出所有先驗框的位置
anchor_shapes = torch.FloatTensor(np.concatenate((np.zeros((self.num_anchors, 2)),
np.array(anchors)), 1))
# 計算重合程度
anch_ious = bbox_iou(gt_box, anchor_shapes)
# Find the best matching anchor box
best_n = np.argmax(anch_ious)
if best_n not in anchor_index:
continue
# Masks
if (gj < in_h) and (gi < in_w):
best_n = anchor_index.index(best_n)
# 判定哪些先驗框內部真實的存在物體
noobj_mask[b, best_n, gj, gi] = 0
mask[b, best_n, gj, gi] = 1
# 計算先驗框中心調整參數
tx[b, best_n, gj, gi] = gx
ty[b, best_n, gj, gi] = gy
# 計算先驗框寬高調整參數
tw[b, best_n, gj, gi] = gw
th[b, best_n, gj, gi] = gh
# 用於獲得xywh的比例
box_loss_scale_x[b, best_n, gj, gi] = target[b][t, 2]
box_loss_scale_y[b, best_n, gj, gi] = target[b][t, 3]
# 物體置信度
tconf[b, best_n, gj, gi] = 1
# 種類
tcls[b, best_n, gj, gi, int(target[b][t, 4])] = 1
else:
print('Step {0} out of bound'.format(b))
print('gj: {0}, height: {1} | gi: {2}, width: {3}'.format(gj, in_h, gi, in_w))
continue
t_box[...,0] = tx
t_box[...,1] = ty
t_box[...,2] = tw
t_box[...,3] = th
return mask, noobj_mask, t_box, tconf, tcls, box_loss_scale_x, box_loss_scale_y
def get_ignore(self,prediction,target,scaled_anchors,in_w, in_h,noobj_mask):
bs = len(target)
anchor_index = [[3,4,5],[1,2,3]][self.feature_length.index(in_w)]
scaled_anchors = np.array(scaled_anchors)[anchor_index]
# 先驗框的中心位置的調整參數
x = torch.sigmoid(prediction[..., 0])
y = torch.sigmoid(prediction[..., 1])
# 先驗框的寬高調整參數
w = prediction[..., 2] # Width
h = prediction[..., 3] # Height
FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
# 生成網格,先驗框中心,網格左上角
grid_x = torch.linspace(0, in_w - 1, in_w).repeat(in_w, 1).repeat(
int(bs*self.num_anchors/2), 1, 1).view(x.shape).type(FloatTensor)
grid_y = torch.linspace(0, in_h - 1, in_h).repeat(in_h, 1).t().repeat(
int(bs*self.num_anchors/2), 1, 1).view(y.shape).type(FloatTensor)
# 生成先驗框的寬高
anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))
anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))
anchor_w = anchor_w.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(w.shape)
anchor_h = anchor_h.repeat(bs, 1).repeat(1, 1, in_h * in_w).view(h.shape)
# 計算調整後的先驗框中心與寬高
pred_boxes = FloatTensor(prediction[..., :4].shape)
pred_boxes[..., 0] = x + grid_x
pred_boxes[..., 1] = y + grid_y
pred_boxes[..., 2] = torch.exp(w) * anchor_w
pred_boxes[..., 3] = torch.exp(h) * anchor_h
for i in range(bs):
pred_boxes_for_ignore = pred_boxes[i]
pred_boxes_for_ignore = pred_boxes_for_ignore.view(-1, 4)
if len(target[i]) > 0:
gx = target[i][:, 0:1] * in_w
gy = target[i][:, 1:2] * in_h
gw = target[i][:, 2:3] * in_w
gh = target[i][:, 3:4] * in_h
gt_box = torch.FloatTensor(np.concatenate([gx, gy, gw, gh],-1)).type(FloatTensor)
anch_ious = jaccard(gt_box, pred_boxes_for_ignore)
for t in range(target[i].shape[0]):
anch_iou = anch_ious[t].view(pred_boxes[i].size()[:3])
noobj_mask[i][anch_iou>self.ignore_threshold] = 0
return noobj_mask, pred_boxes
訓練自己的YOLOV4模型
yolo4整體的文件夾構架如下:
本文使用VOC格式進行訓練。
訓練前將標籤文件放在VOCdevkit文件夾下的VOC2007文件夾下的Annotation中。
訓練前將圖片文件放在VOCdevkit文件夾下的VOC2007文件夾下的JPEGImages中。
在訓練前利用voc2yolo3.py文件生成對應的txt。
再運行根目錄下的voc_annotation.py,運行前需要將classes改成你自己的classes。
classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
就會生成對應的2007_train.txt,每一行對應其圖片位置及其真實框的位置。
在訓練前需要修改model_data裏面的voc_classes.txt文件,需要將classes改成你自己的classes。
運行train.py即可開始訓練。