深入理解YOLO loss-1
本文是對於yolo loss的一個代碼閱讀+理解解析
所參考的代碼爲Pytorch版的yolo v3,附上repository
1.target
首先,我們要決定一下,數據集在迭代時候會返回什麼樣子的target
這裏就會出現一個棘手的問題,每個圖片的物體數量不一樣,我要怎麼把他們用一個相同的緯度存儲起來送進網絡中進行前向傳播
def __getitem__(self, index):
# ---------
# Image
# ---------
#獲得一個訓練圖片的路徑
img_path = self.img_files[index % len(self.img_files)].rstrip()
# Extract image as PyTorch tensor
img = transforms.ToTensor()(Image.open(img_path).convert('RGB'))
# Handle images with less than three channels
# 把灰度圖變成彩色圖像
if len(img.shape) != 3:
img = img.unsqueeze(0)
img = img.expand((3, img.shape[1:]))
_, h, w = img.shape
h_factor, w_factor = (h, w) if self.normalized_labels else (1, 1)
# Pad to square resolution
img, pad = pad_to_square(img, 0)
_, padded_h, padded_w = img.shape
# ---------
# Label
# ---------
label_path = self.label_files[index % len(self.img_files)].rstrip()
targets = None
if os.path.exists(label_path):
boxes = torch.from_numpy(np.loadtxt(label_path).reshape(-1, 5))
# Extract coordinates for unpadded + unscaled image
x1 = w_factor * (boxes[:, 1] - boxes[:, 3] / 2)
y1 = h_factor * (boxes[:, 2] - boxes[:, 4] / 2)
x2 = w_factor * (boxes[:, 1] + boxes[:, 3] / 2)
y2 = h_factor * (boxes[:, 2] + boxes[:, 4] / 2)
# Adjust for added padding
x1 += pad[0]
y1 += pad[2]
x2 += pad[1]
y2 += pad[3]
# Returns (x, y, w, h)
boxes[:, 1] = ((x1 + x2) / 2) / padded_w
boxes[:, 2] = ((y1 + y2) / 2) / padded_h
boxes[:, 3] *= w_factor / padded_w
boxes[:, 4] *= h_factor / padded_h
targets = torch.zeros((len(boxes), 6))
targets[:, 1:] = boxes
# Apply augmentations
if self.augment:
if np.random.random() < 0.5:
img, targets = horisontal_flip(img, targets)
#緯度梳理:img是一個圖片,大小是正方形的
# target是[numboxes,6]
return img_path, img, targets
通讀以上以上代碼,我們可以瞭解到,此項目的作者的思路是把所有box框都拼接在一起,同時拓展一個緯度在第一個軸,用於存儲這個Box屬於哪一個圖片,但是由於dataset還沒有bach這個打包,把相關處理放到了dataloader中
def collate_fn(self, batch):
#圖片歸一化,label加標籤
paths, imgs, targets = list(zip(*batch))
# Remove empty placeholder targets 清空是空的盒子
targets = [boxes for boxes in targets if boxes is not None]
# Add sample index to targets
for i, boxes in enumerate(targets): #之前在boxes前面空了一個位置,這裏存放這張圖片的索引
boxes[:, 0] = i
targets = torch.cat(targets, 0) #把一個batch裏面所有的box拼接在一起,用最後一個緯度的第一個數表示這個是哪一張圖片的box
# Selects new image size every tenth batch
if self.multiscale and self.batch_count % 10 == 0: #在正常的圖片範圍內選擇一個圖片大小
self.img_size = random.choice(range(self.min_size, self.max_size + 1, 32))
# Resize images to input shape
imgs = torch.stack([resize(img, self.img_size) for img in imgs])
self.batch_count += 1
return paths, imgs, targets
2.compose taget
我們參考衆多的CSDN博客,yolo是一個基於anchors的目標監測算法,所以我們的target必須要有anchors這個元素,同時要有檢測框的x y w h conf class_conf這個幾個元素,同時因爲yolo是一個滑動窗口算法的升級,我們也需要解決grid也就是格子數這個因素,因此,我們可以清晰得出,yolo的target是這樣的緯度:
[ batch_size,anchors,grid,grid,(x+y+w+h+conf+num_classes)]
所以,可以意識到最後一個緯度大小爲5+num_classes
參考於上面提到的源碼代碼中,我們可以發現,構建的函數位於utiles文件中,在model裏調用過,這裏是一個yolo layer的輸出,所以應該有三個這樣的yolo layer,同時,每個yolo layer會分到3個anchors
iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
pred_boxes=pred_boxes,
pred_cls=pred_cls,
target=targets,
anchors=self.scaled_anchors,
ignore_thres=self.ignore_thres,
)
首先我們確認一下這些輸入變量的緯度
- target :[batch_size,6]------6表示爲(這個盒子的圖片id,這個盒子的類別,x,y,w,h)
- pred_boxes:[N,anchors,grid,grid,4]
- pred_cls:[N,anchors,grid,grid,num_classes]
- anchors:[3,2]-----這裏用圖片大小和grid大小算出一個縮放比例,乘在anchors上
- ignore_threshold是一個閾值
跳轉到這個函數的內部後,我們觀察一下這個函數究竟做了那些操作
def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
#pred_boxes [N,anchors,grid,grid,(x+y+w+h)]
#pred_cls [N,anchors,grid,grid,num_classes]
#targer [b.6]img_id,cls,x,y,w,h
#anchors [3,2]
#ignore_threshold scale
ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor #這裏是把兩個方法拿了出來
FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor
nB = pred_boxes.size(0) #num_samples
nA = pred_boxes.size(1) #anchors
nC = pred_cls.size(-1) #num_classes
nG = pred_boxes.size(2) #grid
# Output tensors
obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0) #1的部分表示該區域有物體
noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1) #1的區域表示該區域沒有物體
class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
tx = FloatTensor(nB, nA, nG, nG).fill_(0)
ty = FloatTensor(nB, nA, nG, nG).fill_(0)
tw = FloatTensor(nB, nA, nG, nG).fill_(0)
th = FloatTensor(nB, nA, nG, nG).fill_(0)
tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)
# Convert to position relative to box
target_boxes = target[:, 2:6] * nG
gxy = target_boxes[:, :2]
gwh = target_boxes[:, 2:]
# Get anchors with best iou
ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors]) #計算所有boxes和anchors的iou
best_ious, best_n = ious.max(0) #求出每個物體的最大iou值和最大iou的anchors框
# Separate target values
b, target_labels = target[:, :2].long().t()
gx, gy = gxy.t() #所有框的x,y
gw, gh = gwh.t() #所有框的w,h
gi, gj = gxy.long().t() #求所有xy的向下取整的值
# Set masks 設置mask
obj_mask[b, best_n, gj, gi] = 1 #每個框和它最大iou的anchor的座標位置
noobj_mask[b, best_n, gj, gi] = 0
# Set noobj mask to zero where iou exceeds ignore threshold
for i, anchor_ious in enumerate(ious.t()): #[3,b] =>[b,3]
noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0
# Coordinates
tx[b, best_n, gj, gi] = gx - gx.floor()
ty[b, best_n, gj, gi] = gy - gy.floor()
# Width and height
tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
# One-hot encoding of label
tcls[b, best_n, gj, gi, target_labels] = 1
# Compute label correctness and iou at best anchor
class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
tconf = obj_mask.float()
return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf
這裏就完成了target的compose過程,返回的信息有,每個樣本的iou分數,類別mask,有物體存在的mask,無物體存在的mask,以及每個grid的x-y-w-h,還有類別信息的one-hot編碼信息,還有box的置信度