Pytorch|YOWO原理及代碼詳解(二)

Pytorch|YOWO原理及代碼詳解(二)

本博客上接,Pytorch|YOWO原理及代碼詳解(一),閱前可看。

1.正式訓練

    if opt.evaluate:
        logging('evaluating ...')
        test(0)
    else:
        for epoch in range(opt.begin_epoch, opt.end_epoch + 1):
            # Train the model for 1 epoch
            train(epoch)

            # Validate the model
            fscore = test(epoch)

            is_best = fscore > best_fscore
            if is_best:
                print("New best fscore is achieved: ", fscore)
                print("Previous fscore was: ", best_fscore)
                best_fscore = fscore

            # Save the model to backup directory
            state = {
                'epoch': epoch,
                'state_dict': model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'fscore': fscore
            }
            save_checkpoint(state, is_best, backupdir, opt.dataset, clip_duration)
            logging('Weights are saved to backup directory: %s' % (backupdir))

爲了訓練,設置opt.evaluate = False,根據論文(YOWO翻譯)(可知ucf24訓練5個epoch就可以了。

2. train

查看整個train函數。

    def train(epoch):
        global processed_batches
        t0 = time.time()
        cur_model = model.module
        region_loss.l_x.reset()
        region_loss.l_y.reset()
        region_loss.l_w.reset()
        region_loss.l_h.reset()
        region_loss.l_conf.reset()
        region_loss.l_cls.reset()
        region_loss.l_total.reset()
        train_loader = torch.utils.data.DataLoader(
            dataset.listDataset(basepath, trainlist, dataset_use=dataset_use, shape=(init_width, init_height),
                                shuffle=True,
                                transform=transforms.Compose([
                                    transforms.ToTensor(),
                                ]),
                                train=True,
                                seen=cur_model.seen,
                                batch_size=batch_size,
                                clip_duration=clip_duration,
                                num_workers=num_workers),
            batch_size=batch_size, shuffle=False, **kwargs)

        lr = adjust_learning_rate(optimizer, processed_batches)
        logging('training at epoch %d, lr %f' % (epoch, lr))

        model.train()

        for batch_idx, (data, target) in enumerate(train_loader):
            adjust_learning_rate(optimizer, processed_batches)
            processed_batches = processed_batches + 1

            if use_cuda:
                data = data.cuda()

            optimizer.zero_grad()
            output = model(data)
            region_loss.seen = region_loss.seen + data.data.size(0)
            loss = region_loss(output, target)
            loss.backward()
            optimizer.step()

            # save result every 1000 batches
            if processed_batches % 500 == 0:  # From time to time, reset averagemeters to see improvements
                region_loss.l_x.reset()
                region_loss.l_y.reset()
                region_loss.l_w.reset()
                region_loss.l_h.reset()
                region_loss.l_conf.reset()
                region_loss.l_cls.reset()
                region_loss.l_total.reset()

        t1 = time.time()
        logging('trained with %f samples/s' % (len(train_loader.dataset) / (t1 - t0)))
        print('')

processed_batches是全局變量,存儲已經處理的batch數,方便斷點繼續訓練。t0 = time.time()記錄當前的時間。region_loss初始化。

2.1 加載訓練數據集

訓練數據集是放在listDataset類中。listDataset是在dataset.py中,完整代碼如下:

class listDataset(Dataset):
    # clip duration = 8, i.e, for each time 8 frames are considered together
    def __init__(self, base, root, dataset_use='ucf101-24', shape=None, shuffle=True,
                 transform=None, target_transform=None, 
                 train=False, seen=0, batch_size=64,
                 clip_duration=16, num_workers=4):
        with open(root, 'r') as file:
            self.lines = file.readlines()
        if shuffle:
            random.shuffle(self.lines)
        self.base_path = base
        self.dataset_use = dataset_use
        self.nSamples  = len(self.lines)
        self.transform = transform
        self.target_transform = target_transform
        self.train = train
        self.shape = shape
        self.seen = seen
        self.batch_size = batch_size
        self.clip_duration = clip_duration
        self.num_workers = num_workers
    def __len__(self):
        return self.nSamples
    def __getitem__(self, index):
        assert index <= len(self), 'index range error'
        imgpath = self.lines[index].rstrip()
        self.shape = (224, 224)
        if self.train: # For Training
            jitter = 0.2
            hue = 0.1
            saturation = 1.5 
            exposure = 1.5
            clip, label = load_data_detection(self.base_path, imgpath,  self.train, self.clip_duration, self.shape, self.dataset_use, jitter, hue, saturation, exposure)
        else: # For Testing
            frame_idx, clip, label = load_data_detection(self.base_path, imgpath, False, self.clip_duration, self.shape, self.dataset_use)
            clip = [img.resize(self.shape) for img in clip]
        if self.transform is not None:
            clip = [self.transform(img) for img in clip]
        # (self.duration, -1) + self.shape = (8, -1, 224, 224)
        clip = torch.cat(clip, 0).view((self.clip_duration, -1) + self.shape).permute(1, 0, 2, 3)
        if self.target_transform is not None:
            label = self.target_transform(label)
        self.seen = self.seen + self.num_workers
        if self.train:
            return (clip, label)
        else:
            return (frame_idx, clip, label)

self.lines存儲讀去trainlist.txt的文本內容:
在這裏插入圖片描述
random.shuffle(self.lines)是對其進行打亂。剩下的就是一頓初始化:
在這裏插入圖片描述

2.2 學習率調整

lr = adjust_learning_rate(optimizer, processed_batches)

完整代碼:

    def adjust_learning_rate(optimizer, batch):
        lr = learning_rate
        for i in range(len(steps)):
            scale = scales[i] if i < len(scales) else 1
            if batch >= steps[i]:
                lr = lr * scale
                if batch == steps[i]:
                    break
            else:
                break
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr / batch_size
        return lr

學習率是根據steps進行不斷調整的,如下:

        ......
        lr = adjust_learning_rate(optimizer, processed_batches)
        logging('training at epoch %d, lr %f' % (epoch, lr))
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            adjust_learning_rate(optimizer, processed_batches)
            processed_batches = processed_batches + 1
        ......

scalesstepucf24.cfg中進行設置的衰減策略,如下:

在這裏插入圖片描述

2.3 獲取訓練數據

這段for batch_idx, (data, target) in enumerate(train_loader):中的data, target是通過listDataset在的def __getitem__(self, index):進行獲取的。
在這裏插入圖片描述
其中:

            jitter = 0.2
            hue = 0.1
            saturation = 1.5 
            exposure = 1.5

是yolov2中用於數據增強的數據,參考YOLOv2 參數詳解,其中各項參數意義如下:

  • jitter:利用數據抖動產生更多數據
  • hue:色調變化範圍
  • saturation & exposure: 飽和度與曝光變化大小

這部分最關鍵的是:load_data_detection(self.base_path, imgpath, self.train, self.clip_duration, self.shape, self.dataset_use, jitter, hue, saturation, exposure)
完整代碼如下:

def load_data_detection(base_path, imgpath, train, train_dur, shape, dataset_use='ucf101-24', jitter=0.2, hue=0.1, saturation=1.5, exposure=1.5):
    # clip loading and  data augmentation
    # if dataset_use == 'ucf101-24':
    #     base_path = "/usr/home/sut/datasets/ucf24"
    # else:
    #     base_path = "/usr/home/sut/Tim-Documents/jhmdb/data/jhmdb"
    im_split = imgpath.split('/')
    num_parts = len(im_split)
    im_ind = int(im_split[num_parts-1][0:5])
    labpath = os.path.join(base_path, 'labels', im_split[0], im_split[1] ,'{:05d}.txt'.format(im_ind))
    img_folder = os.path.join(base_path, 'rgb-images', im_split[0], im_split[1])
    if dataset_use == 'ucf101-24':
        max_num = len(os.listdir(img_folder))
    else:
        max_num = len(os.listdir(img_folder)) - 1
    clip = []
    ### We change downsampling rate throughout training as a ###
    ### temporal augmentation, which brings around 1-2 frame ###
    ### mAP. During test time it is set to 1.                ###
    d = 1 
    if train:
        d = random.randint(1, 2)
    for i in reversed(range(train_dur)):
        # make it as a loop
        i_temp = im_ind - i * d
        while i_temp < 1:
            i_temp = max_num + i_temp
        while i_temp > max_num:
            i_temp = i_temp - max_num
        if dataset_use == 'ucf101-24':
            path_tmp = os.path.join(base_path, 'rgb-images', im_split[0], im_split[1] ,'{:05d}.jpg'.format(i_temp))
        else:
            path_tmp = os.path.join(base_path, 'rgb-images', im_split[0], im_split[1] ,'{:05d}.png'.format(i_temp))
        clip.append(Image.open(path_tmp).convert('RGB'))
    if train: # Apply augmentation
        clip,flip,dx,dy,sx,sy = data_augmentation(clip, shape, jitter, hue, saturation, exposure)
        label = fill_truth_detection(labpath, clip[0].width, clip[0].height, flip, dx, dy, 1./sx, 1./sy)
        label = torch.from_numpy(label)
    else: # No augmentation
        label = torch.zeros(50*5)
        try:
            tmp = torch.from_numpy(read_truths_args(labpath, 8.0/clip[0].width).astype('float32'))
        except Exception:
            tmp = torch.zeros(1,5)
        tmp = tmp.view(-1)
        tsz = tmp.numel()
        if tsz > 50*5:
            label = tmp[0:50*5]
        elif tsz > 0:
            label[0:tsz] = tmp
    if train:
        return clip, label
    else:
        return im_split[0] + '_' +im_split[1] + '_' + im_split[2], clip, label

通過把路徑進行分割:im_split = imgpath.split('/'),來找到標註labpath和對應的文件夾img_folder
在這裏插入圖片描述
“在整個訓練過程中,改變下采樣率作爲一個時間增量,得到1-2幀左右的圖像。在測試期間,它被設置爲1。”這個下采樣率是指幀與幀之間的採樣距離,如果d=2,則沒隔兩幀讀取數據,依此類推。

    d = 1 
    if train:
        d = random.randint(1, 2)

這個train_dur對應的參數是self.clip_duration,剪輯持續時間,默認設置的是16。im_ind(在上述圖片中可以看到,是56)是標註是整個視頻(圖像,視頻被切割成一張張的圖像)序列中的ID。

        i_temp = im_ind - i * d
        while i_temp < 1:
            i_temp = max_num + i_temp
        while i_temp > max_num:
            i_temp = i_temp - max_num

上述代碼是爲了現在i_temp有效,如果溢出,則使用循環序列。
path_tmp則是獲取對應幀的圖像,如下:
在這裏插入圖片描述
clip.append(Image.open(path_tmp).convert('RGB'))將其轉換成RGB模式,添加到序列clip中。
在訓練的過程中還會使用數據增強來擴充數據集:

    if train: # Apply augmentation
        clip,flip,dx,dy,sx,sy = data_augmentation(clip, shape, jitter, hue, saturation, exposure)
        label = fill_truth_detection(labpath, clip[0].width, clip[0].height, flip, dx, dy, 1./sx, 1./sy)
        label = torch.from_numpy(label)

data_augmentation完整代碼如下:

def data_augmentation(clip, shape, jitter, hue, saturation, exposure):
    # Initialize Random Variables
    oh = clip[0].height  
    ow = clip[0].width
    dw =int(ow*jitter)
    dh =int(oh*jitter)
    pleft  = random.randint(-dw, dw)
    pright = random.randint(-dw, dw)
    ptop   = random.randint(-dh, dh)
    pbot   = random.randint(-dh, dh)
    swidth =  ow - pleft - pright
    sheight = oh - ptop - pbot
    sx = float(swidth)  / ow
    sy = float(sheight) / oh 
    dx = (float(pleft)/ow)/sx
    dy = (float(ptop) /oh)/sy
    flip = random.randint(1,10000)%2
    dhue = random.uniform(-hue, hue)
    dsat = rand_scale(saturation)
    dexp = rand_scale(exposure)
    # Augment
    cropped = [img.crop((pleft, ptop, pleft + swidth - 1, ptop + sheight - 1)) for img in clip]
    sized = [img.resize(shape) for img in cropped]
    if flip: 
        sized = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in sized]
    clip = [random_distort_image(img, dhue, dsat, dexp) for img in sized]
    return clip, flip, dx, dy, sx, sy 

關於數據增強,這裏有幾個參數是yolov2中的,上述已經講過,即是對圖像進行抖動,改變色調、飽和度以及曝光度,並進行尺度歸一化,縮放爲224×224224 \times 224
pleftptop是靠左(向右),靠上(向下)的偏移量,swidthsheight是數據抖動後的寬和高,通過這些參數進行裁剪,cropped = [img.crop((pleft, ptop, pleft + swidth - 1, ptop + sheight - 1)) for img in clip]
並進一步尺度歸一化:sized = [img.resize(shape) for img in cropped]
如果標誌位flip爲true,則還會進行水平翻轉:sized = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in sized]
由於圖像以及增強了,那麼對應的標註label也需要進行對應的改變:label = fill_truth_detection(labpath, clip[0].width, clip[0].height, flip, dx, dy, 1./sx, 1./sy)。由於圖像增強,主要是圖像的偏移以及方式,所以對應標籤的變化需要知道圖像是否水平翻轉,以及圖像的偏移量和放縮量,即flip, dx, dy, 1./sx, 1./sy。查看完整代碼:

def fill_truth_detection(labpath, w, h, flip, dx, dy, sx, sy):
    max_boxes = 50
    label = np.zeros((max_boxes,5))
    if os.path.getsize(labpath):
        bs = np.loadtxt(labpath)
        if bs is None:
            return label
        bs = np.reshape(bs, (-1, 5))
        for i in range(bs.shape[0]):
            cx = (bs[i][1] + bs[i][3]) / (2 * 320)
            cy = (bs[i][2] + bs[i][4]) / (2 * 240)
            imgw = (bs[i][3] - bs[i][1]) / 320
            imgh = (bs[i][4] - bs[i][2]) / 240
            bs[i][0] = bs[i][0] - 1
            bs[i][1] = cx
            bs[i][2] = cy
            bs[i][3] = imgw
            bs[i][4] = imgh
        cc = 0
        for i in range(bs.shape[0]):
            x1 = bs[i][1] - bs[i][3]/2
            y1 = bs[i][2] - bs[i][4]/2
            x2 = bs[i][1] + bs[i][3]/2
            y2 = bs[i][2] + bs[i][4]/2            
            x1 = min(0.999, max(0, x1 * sx - dx)) 
            y1 = min(0.999, max(0, y1 * sy - dy)) 
            x2 = min(0.999, max(0, x2 * sx - dx))
            y2 = min(0.999, max(0, y2 * sy - dy))
            bs[i][1] = (x1 + x2)/2
            bs[i][2] = (y1 + y2)/2
            bs[i][3] = (x2 - x1)
            bs[i][4] = (y2 - y1)
            if flip:
                bs[i][1] =  0.999 - bs[i][1] 
            if bs[i][3] < 0.001 or bs[i][4] < 0.001:
                continue
            label[cc] = bs[i]
            cc += 1
            if cc >= 50:
                break
    label = np.reshape(label, (-1))
    return label

根據代碼,可以推斷label中的標註是[numclass,xmin,ymin,xmax,ymax][num_class, x_{min}, y_{min},x_{max},y_{max}]的格式。

            cx = (bs[i][1] + bs[i][3]) / (2 * 320)
            cy = (bs[i][2] + bs[i][4]) / (2 * 240)
            imgw = (bs[i][3] - bs[i][1]) / 320
            imgh = (bs[i][4] - bs[i][2]) / 240

所以,cx和cy是標註中心在圖像中的相對位置(0–1之間),imgw和imgh是標註在圖像中的相對寬和高(0–1之間)。 bs[i][0]標識對應的類別, bs[i][0] - 1是因爲類別從0開始。因爲標註增強,是需要圖像的偏移量和放縮量,且都是在0-1之間。
x1,y1,x2和y2則是歸一化之後的的[xmin,ymin,xmax,ymax][x_{min}, y_{min},x_{max},y_{max}]。諸如min(0.999, max(0, x1 * sx - dx))之類的,則是保證座標是在圖像範圍內,沒有溢出。接下來的bs[i][1]…bs[i][4]則是加上偏移量和放縮量之後的標註信息:[xc,yc,W,H][x_{c}, y_{c},W,H]。如果有水平翻轉標誌flip,進行翻轉。如果寬或者高太小if bs[i][3] < 0.001 or bs[i][4] < 0.001,則跳過。接着,把符合要求的標註放進label 中。在數據增強之後,返回cliplabel

    if train:
        return clip, label

clip是連續幀的圖像(224),label是對應的(歸一化之後的)目標標籤。label的長度爲250,是因爲默認任務有50個目標(target),每個目標(target)對應五個值,[class,xc,yc,W,H][class,x_{c}, y_{c},W,H],如果沒有目標,則默認爲0。
在這裏插入圖片描述
在這裏插入圖片描述
在這之後,返回到listDataset類的__getitem__函數中:if self.transform is not None:判斷是否需要使用transform進行圖像變化,關於transform的使用,可以查看博文:圖像預處理——transforms。接下來:clip = torch.cat(clip, 0).view((self.clip_duration, -1) + self.shape).permute(1, 0, 2, 3)則是把連續採樣的16幀圖像進行拼接,並調整其shape,按照格式:[Channel,Depth,Height,Weight][Channel,Depth,Height,Weight],即通道、深度、高和寬,這是爲了和torch.nn.Conv3D對應起來,可以看一下clip的shape:
在這裏插入圖片描述
是滿足預期的,通道3,深度16,高224和寬224。接下來通過下述代碼,返回數據。

        if self.train:
            return (clip, label)

2.4 loss計算與反向傳播

在獲取到訓練數據和標註之後,則是進行loss計算和反向傳播優化:

        for batch_idx, (data, target) in enumerate(train_loader):
            adjust_learning_rate(optimizer, processed_batches)
            processed_batches = processed_batches + 1

            if use_cuda:
                data = data.cuda()

            optimizer.zero_grad()
            output = model(data)
            region_loss.seen = region_loss.seen + data.data.size(0)
            loss = region_loss(output, target)
            loss.backward()
            optimizer.step()

            # save result every 1000 batches
            if processed_batches % 500 == 0:  # From time to time, reset averagemeters to see improvements
                region_loss.l_x.reset()
                region_loss.l_y.reset()
                region_loss.l_w.reset()
                region_loss.l_h.reset()
                region_loss.l_conf.reset()
                region_loss.l_cls.reset()
                region_loss.l_total.reset()

        t1 = time.time()
        logging('trained with %f samples/s' % (len(train_loader.dataset) / (t1 - t0)))
        print('')

每一次迭代,都使用adjust_learning_rate,覺得是否調整學習率。下在看下模型前向傳播的tensor的shape變化:output = model(data)。這裏data的shape爲:[4,3,16,224,224][4,3,16,224,224],其中的4是batch size,每批有四個數據。現在進入(step into)查看,到了YOWO的forward函數中:

    def forward(self, input):
        x_3d = input # Input clip
        x_2d = input[:, :, -1, :, :] # Last frame of the clip that is read

        x_2d = self.backbone_2d(x_2d)
        x_3d = self.backbone_3d(x_3d)
        x_3d = torch.squeeze(x_3d, dim=2)

        x = torch.cat((x_3d, x_2d), dim=1)
        x = self.cfam(x)

        out = self.conv_final(x)

        return out

x_3d是對應整個視頻,而x_2d只是對應視頻的最好一幀(上面在生成數據的時候講到,是根據標註的圖片,向前採樣,所以x_2d是整個視頻序列的最後一幀)。整個流程如下:

  • x_2d輸入到2d網絡中:x_2d = self.backbone_2d(x_2d),輸出的shape爲4×425×7×74 \times 425 \times 7 \times 7,這個是yolov2的預測機制,把圖像分成7×77 \times 7大小的grid cell,每個grid cell使用5個anchor預測80個類別的座標,這就是425的由來,即5×(5+80)5 \times (5 + 80)
  • x_3d輸入到3d網絡中:x_3d = self.backbone_3d(x_3d),輸出的shape爲4×2048×1×7×74 \times 2048 \times1\times 7 \times 7
  • 按照論文的要求是需要把x_3d和x_2d按通道拼接,但是x_3d多了一個維度(depth),所以使用x_3d = torch.squeeze(x_3d, dim=2)壓縮維度。
  • 再使用x = torch.cat((x_3d, x_2d), dim=1)進行通道拼接。
  • 接着再輸入CFAM模塊x = self.cfam(x),得到輸出x,其shape爲4×1024×7×74 \times 1024 \times 7 \times 7
    最後再使用一個卷積層,得到行爲理解的時空定位:out = self.conv_final(x),其中:self.conv_final = nn.Conv2d(1024, 5*(opt.n_classes+4+1), kernel_size=1, bias=False),其中opt.n_classes = 24,即實現對行爲理解的24分類。最後輸出的out的shape爲:4×145×7×74 \times 145\times 7 \times 7

通過上述分析,瞭解到模型中的output,即上文out,的由來。接下來,便是loss的計算:loss = region_loss(output, target)。進入(step into)其中進行查看,即RegionLoss類的forward函數:

    def forward(self, output, target):
        # output : B*A*(4+1+num_classes)*H*W
        # B: number of batches
        # A: number of anchors
        # 4: 4 parameters for each bounding box
        # 1: confidence score
        # num_classes
        # H: height of the image (in grids)
        # W: width of the image (in grids)
        # for each grid cell, there are A*(4+1+num_classes) parameters
        t0 = time.time()
        nB = output.data.size(0)
        nA = self.num_anchors
        nC = self.num_classes
        nH = output.data.size(2)
        nW = output.data.size(3)

        # resize the output (all parameters for each anchor can be reached)
        output   = output.view(nB, nA, (5+nC), nH, nW)
        # anchor's parameter tx
        x    = torch.sigmoid(output.index_select(2, Variable(torch.cuda.LongTensor([0]))).view(nB, nA, nH, nW))
        # anchor's parameter ty
        y    = torch.sigmoid(output.index_select(2, Variable(torch.cuda.LongTensor([1]))).view(nB, nA, nH, nW))
        # anchor's parameter tw
        w    = output.index_select(2, Variable(torch.cuda.LongTensor([2]))).view(nB, nA, nH, nW)
        # anchor's parameter th
        h    = output.index_select(2, Variable(torch.cuda.LongTensor([3]))).view(nB, nA, nH, nW)
        # confidence score for each anchor
        conf = torch.sigmoid(output.index_select(2, Variable(torch.cuda.LongTensor([4]))).view(nB, nA, nH, nW))
        # anchor's parameter class label
        cls  = output.index_select(2, Variable(torch.linspace(5,5+nC-1,nC).long().cuda()))
        # resize the data structure so that for every anchor there is a class label in the last dimension
        cls  = cls.view(nB*nA, nC, nH*nW).transpose(1,2).contiguous().view(nB*nA*nH*nW, nC)
        t1 = time.time()

        # for the prediction of localization of each bounding box, there exist 4 parameters (tx, ty, tw, th)
        pred_boxes = torch.cuda.FloatTensor(4, nB*nA*nH*nW)
        # tx and ty
        grid_x = torch.linspace(0, nW-1, nW).repeat(nH,1).repeat(nB*nA, 1, 1).view(nB*nA*nH*nW).cuda()
        grid_y = torch.linspace(0, nH-1, nH).repeat(nW,1).t().repeat(nB*nA, 1, 1).view(nB*nA*nH*nW).cuda()
        # for each anchor there are anchor_step variables (with the structure num_anchor*anchor_step)
        # for each row(anchor), the first variable is anchor's width, second is anchor's height
        # pw and ph
        anchor_w = torch.Tensor(self.anchors).view(nA, self.anchor_step).index_select(1, torch.LongTensor([0])).cuda()
        anchor_h = torch.Tensor(self.anchors).view(nA, self.anchor_step).index_select(1, torch.LongTensor([1])).cuda()
        # for each pixel (grid) repeat the above process (obtain width and height of each grid)
        anchor_w = anchor_w.repeat(nB, 1).repeat(1, 1, nH*nW).view(nB*nA*nH*nW)
        anchor_h = anchor_h.repeat(nB, 1).repeat(1, 1, nH*nW).view(nB*nA*nH*nW)
        # prediction of bounding box localization
        # x.data and y.data: top left corner of the anchor
        # grid_x, grid_y: tx and ty predictions made by yowo

        x_data = x.data.view(-1)
        y_data = y.data.view(-1)
        w_data = w.data.view(-1)
        h_data = h.data.view(-1)

        pred_boxes[0] = x_data + grid_x    # bx
        pred_boxes[1] = y_data + grid_y    # by
        pred_boxes[2] = torch.exp(w_data) * anchor_w    # bw
        pred_boxes[3] = torch.exp(h_data) * anchor_h    # bh
        # the size -1 is inferred from other dimensions
        # pred_boxes (nB*nA*nH*nW, 4)
        pred_boxes = convert2cpu(pred_boxes.transpose(0,1).contiguous().view(-1,4))
        t2 = time.time()

        nGT, nCorrect, coord_mask, conf_mask, cls_mask, tx, ty, tw, th, tconf, tcls = build_targets(pred_boxes, target.data, self.anchors, nA, nC, \
                                                               nH, nW, self.noobject_scale, self.object_scale, self.thresh, self.seen)
        cls_mask = (cls_mask == 1)
        #  keep those with high box confidence scores (greater than 0.25) as our final predictions
        nProposals = int((conf > 0.25).sum().data.item())

        tx    = Variable(tx.cuda())
        ty    = Variable(ty.cuda())
        tw    = Variable(tw.cuda())
        th    = Variable(th.cuda())
        tconf = Variable(tconf.cuda())
        tcls = Variable(tcls.view(-1)[cls_mask.view(-1)].long().cuda())

        coord_mask = Variable(coord_mask.cuda())
        conf_mask  = Variable(conf_mask.cuda().sqrt())
        cls_mask   = Variable(cls_mask.view(-1, 1).repeat(1,nC).cuda())
        cls        = cls[cls_mask].view(-1, nC)  

        t3 = time.time()

        # losses between predictions and targets (ground truth)
        # In total 6 aspects are considered as losses: 
        # 4 for bounding box location, 2 for prediction confidence and classification seperately
        loss_x = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(x*coord_mask, tx*coord_mask)/2.0
        loss_y = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(y*coord_mask, ty*coord_mask)/2.0
        loss_w = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(w*coord_mask, tw*coord_mask)/2.0
        loss_h = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(h*coord_mask, th*coord_mask)/2.0
        loss_conf = nn.MSELoss(reduction='sum')(conf*conf_mask, tconf*conf_mask)/2.0

        # try focal loss with gamma = 2
        FL = FocalLoss(class_num=24, gamma=2, size_average=False)
        loss_cls = self.class_scale * FL(cls, tcls)

        # sum of loss
        loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls
        #print(loss)
        t4 = time.time()

        self.l_x.update(loss_x.data.item(), self.batch)
        self.l_y.update(loss_y.data.item(), self.batch)
        self.l_w.update(loss_w.data.item(), self.batch)
        self.l_h.update(loss_h.data.item(), self.batch)
        self.l_conf.update(loss_conf.data.item(), self.batch)
        self.l_cls.update(loss_cls.data.item(), self.batch)
        self.l_total.update(loss.data.item(), self.batch)


        if False:
            print('-----------------------------------')
            print('        activation : %f' % (t1 - t0))
            print(' create pred_boxes : %f' % (t2 - t1))
            print('     build targets : %f' % (t3 - t2))
            print('       create loss : %f' % (t4 - t3))
            print('             total : %f' % (t4 - t0))
        print('%d: nGT %d, recall %d, proposals %d, loss: x %.2f(%.2f), '
              'y %.2f(%.2f), w %.2f(%.2f), h %.2f(%.2f), conf %.2f(%.2f), '
              'cls %.2f(%.2f), total %.2f(%.2f)'
               % (self.seen, nGT, nCorrect, nProposals, self.l_x.val, self.l_x.avg,
                self.l_y.val, self.l_y.avg, self.l_w.val, self.l_w.avg,
                self.l_h.val, self.l_h.avg, self.l_conf.val, self.l_conf.avg,
                self.l_cls.val, self.l_cls.avg, self.l_total.val, self.l_total.avg))
        return loss

nB是batch size,nA是anchor數,nC是類別數,nH是特徵圖feature map,即output,的高,同理,nW是寬。output = output.view(nB, nA, (5+nC), nH, nW)講output的shape進行重塑。
接下來的x = torch.sigmoid(output.index_select(2, Variable(torch.cuda.LongTensor([0]))).view(nB, nA, nH, nW))conf = torch.sigmoid(output.index_select(2, Variable(torch.cuda.LongTensor([4]))).view(nB, nA, nH, nW)),則是分別得到每個anchor的預測的x,y,w,h和confidence(置信度得分)。cls = output.index_select(2, Variable(torch.linspace(5,5+nC-1,nC).long().cuda()))則是得到每個anchor在每一個gird cell上的類別預測結構,其shape爲:4×5×24×7×74\times 5\times 24\times 7\times 7。接下來調整數據結構的大小,以便在最後一個維度中爲每個錨(anchor)都有一個類(class)標籤:cls = cls.view(nB*nA, nC, nH*nW).transpose(1,2).contiguous().view(nB*nA*nH*nW, nC),現在的shape爲980×24980 \times 24,其中980=4×5×7×7980 = 4\times 5 \times 7 \times 7,表示有980個anchor。
對於每個邊界框的定位預測,有4個參數(tx,ty,tw,th)。接下來生成grid cell的座標索引,查看grid_xgrid_y:
在這裏插入圖片描述
一共980個anchor,grid_xgrid_y分別對應每個anchor的x和y的索引,即【0,0】、【0,1】…【0,6】、【1,0】…
接下來是獲取cfg文件中anchor的寬和高,即anchor_w和anchor_h。下面的代碼則是把cfg中每個anchor的寬和高分別映射到那980個anchor中:

        anchor_w = anchor_w.repeat(nB, 1).repeat(1, 1, nH*nW).view(nB*nA*nH*nW)
        anchor_h = anchor_h.repeat(nB, 1).repeat(1, 1, nH*nW).view(nB*nA*nH*nW)

查看shape:
在這裏插入圖片描述
在這裏插入圖片描述
爲了和980個anchor對應起來,把x,y,w和h,拉平(flatten),生成對應的x_data,y_data,w_data和h_data。加上偏移量和放縮量,得到預測框的位置:

        pred_boxes[0] = x_data + grid_x    # bx
        pred_boxes[1] = y_data + grid_y    # by
        pred_boxes[2] = torch.exp(w_data) * anchor_w    # bw
        pred_boxes[3] = torch.exp(h_data) * anchor_h    # bh

如果這裏看不懂,可以查看一下yolov2的論文,關於邊界框預測的機制,即下圖:
在這裏插入圖片描述
接下來把pred_boxes放在cpu上,並重塑其shape爲(nBnAnHnW,4)(nB*nA*nH*nW, 4)
然後代碼進行計算:nGT, nCorrect, coord_mask, conf_mask, cls_mask, tx, ty, tw, th, tconf, tcls = build_targets(pred_boxes, target.data, self.anchors, nA, nC, nH, nW, self.noobject_scale, self.object_scale, self.thresh, self.seen),查看函數build_targets,完整代碼如下:

def build_targets(pred_boxes, target, anchors, num_anchors, num_classes, nH, nW, noobject_scale, object_scale, sil_thresh, seen):
    # nH, nW here are number of grids in y and x directions (7, 7 here)
    nB = target.size(0) # batch size
    nA = num_anchors    # 5 for our case
    nC = num_classes
    anchor_step = len(anchors)//num_anchors
    conf_mask  = torch.ones(nB, nA, nH, nW) * noobject_scale
    coord_mask = torch.zeros(nB, nA, nH, nW)
    cls_mask   = torch.zeros(nB, nA, nH, nW)
    tx         = torch.zeros(nB, nA, nH, nW) 
    ty         = torch.zeros(nB, nA, nH, nW) 
    tw         = torch.zeros(nB, nA, nH, nW) 
    th         = torch.zeros(nB, nA, nH, nW) 
    tconf      = torch.zeros(nB, nA, nH, nW)
    tcls       = torch.zeros(nB, nA, nH, nW) 

    # for each grid there are nA anchors
    # nAnchors is the number of anchor for one image
    nAnchors = nA*nH*nW
    nPixels  = nH*nW
    # for each image
    for b in xrange(nB):
        # get all anchor boxes in one image
        # (4 * nAnchors)
        cur_pred_boxes = pred_boxes[b*nAnchors:(b+1)*nAnchors].t()
        # initialize iou score for each anchor
        cur_ious = torch.zeros(nAnchors)
        for t in xrange(50):
            # for each anchor 4 coordinate parameters, already in the coordinate system for the whole image
            # this loop is for anchors in each image
            # for each anchor 5 parameters are available (class, x, y, w, h)
            if target[b][t*5+1] == 0:
                break
            gx = target[b][t*5+1]*nW
            gy = target[b][t*5+2]*nH
            gw = target[b][t*5+3]*nW
            gh = target[b][t*5+4]*nH
            # groud truth boxes
            cur_gt_boxes = torch.FloatTensor([gx,gy,gw,gh]).repeat(nAnchors,1).t()
            # bbox_ious is the iou value between orediction and groud truth
            cur_ious = torch.max(cur_ious, bbox_ious(cur_pred_boxes, cur_gt_boxes, x1y1x2y2=False))
        # if iou > a given threshold, it is seen as it includes an object
        # conf_mask[b][cur_ious>sil_thresh] = 0
        conf_mask_t = conf_mask.view(nB, -1)
        conf_mask_t[b][cur_ious>sil_thresh] = 0
        conf_mask_tt = conf_mask_t[b].view(nA, nH, nW)
        conf_mask[b] = conf_mask_tt

    if seen < 12800:
       if anchor_step == 4:
           tx = torch.FloatTensor(anchors).view(nA, anchor_step).index_select(1, torch.LongTensor([2])).view(1,nA,1,1).repeat(nB,1,nH,nW)
           ty = torch.FloatTensor(anchors).view(num_anchors, anchor_step).index_select(1, torch.LongTensor([2])).view(1,nA,1,1).repeat(nB,1,nH,nW)
       else:
           tx.fill_(0.5)
           ty.fill_(0.5)
       tw.zero_()
       th.zero_()
       coord_mask.fill_(1)

    # number of ground truth
    nGT = 0
    nCorrect = 0
    for b in xrange(nB):
        # anchors for one batch (at least batch size, and for some specific classes, there might exist more than one anchor)
        for t in xrange(50):
            if target[b][t*5+1] == 0:
                break
            nGT = nGT + 1
            best_iou = 0.0
            best_n = -1
            min_dist = 10000
            # the values saved in target is ratios
            # times by the width and height of the output feature maps nW and nH
            gx = target[b][t*5+1] * nW
            gy = target[b][t*5+2] * nH
            gi = int(gx)
            gj = int(gy)
            gw = target[b][t*5+3] * nW
            gh = target[b][t*5+4] * nH
            gt_box = [0, 0, gw, gh]
            for n in xrange(nA):
                # get anchor parameters (2 values)
                aw = anchors[anchor_step*n]
                ah = anchors[anchor_step*n+1]
                anchor_box = [0, 0, aw, ah]
                # only consider the size (width and height) of the anchor box
                iou  = bbox_iou(anchor_box, gt_box, x1y1x2y2=False)
                if anchor_step == 4:
                    ax = anchors[anchor_step*n+2]
                    ay = anchors[anchor_step*n+3]
                    dist = pow(((gi+ax) - gx), 2) + pow(((gj+ay) - gy), 2)
                # get the best anchor form with the highest iou
                if iou > best_iou:
                    best_iou = iou
                    best_n = n
                elif anchor_step==4 and iou == best_iou and dist < min_dist:
                    best_iou = iou
                    best_n = n
                    min_dist = dist

            # then we determine the parameters for an anchor (4 values together)
            gt_box = [gx, gy, gw, gh]
            # find corresponding prediction box
            pred_box = pred_boxes[b*nAnchors+best_n*nPixels+gj*nW+gi]

            # only consider the best anchor box, for each image
            coord_mask[b][best_n][gj][gi] = 1
            cls_mask[b][best_n][gj][gi] = 1
            # in this cell of the output feature map, there exists an object
            conf_mask[b][best_n][gj][gi] = object_scale
            tx[b][best_n][gj][gi] = target[b][t*5+1] * nW - gi
            ty[b][best_n][gj][gi] = target[b][t*5+2] * nH - gj
            tw[b][best_n][gj][gi] = math.log(gw/anchors[anchor_step*best_n])
            th[b][best_n][gj][gi] = math.log(gh/anchors[anchor_step*best_n+1])
            iou = bbox_iou(gt_box, pred_box, x1y1x2y2=False) # best_iou
            # confidence equals to iou of the corresponding anchor
            tconf[b][best_n][gj][gi] = iou
            tcls[b][best_n][gj][gi] = target[b][t*5]
            # if ious larger than 0.5, we justify it as a correct prediction
            if iou > 0.5:
                nCorrect = nCorrect + 1

    # true values are returned
    return nGT, nCorrect, coord_mask, conf_mask, cls_mask, tx, ty, tw, th, tconf, tcls

這個函數用於構建groud truth(標註),顯示生成shape爲nBnAnHnWnB* nA* nH* nW大小的conf_mask,…,tcls。這個build_targets和我在博文Pytorch | yolov3原理及代碼詳解(二)提到的基本一致。對每個grid都有nA(5)個anchor,nA是一張圖片上使用的anchor數量。
接下來來開始遍歷:for b in xrange(nB):,獲取每張圖片的所有anchor。獲取當前的所有的預測框:cur_pred_boxes = pred_boxes[b*nAnchors:(b+1)*nAnchors].t()其shape爲42454*245。一張圖上有245個anchor,由5775*7*7計算而來。對於每個anchor4個座標參數,已在整個圖像的座標系中。此循環用於每個圖像中的anchor:for t in xrange(50):,這個50與建立label時默認最多50個target對應。對於每個anchor,有5個參數可用(class,x,y,w,h)。
​根據target,乘以特徵圖feature map的高(nH)和寬(nW)的係數,得到在777*7大小的特徵圖中的絕對座標:gx,gy,gw,gh。通過這四個值,即可得到ground truth boxes(標註的邊界框)cur_gt_boxes
接下來計算iou值:cur_ious = torch.max(cur_ious, bbox_ious(cur_pred_boxes, cur_gt_boxes, x1y1x2y2=False)),查看完整代碼:

def bbox_ious(boxes1, boxes2, x1y1x2y2=True):
    if x1y1x2y2:
        mx = torch.min(boxes1[0], boxes2[0])
        Mx = torch.max(boxes1[2], boxes2[2])
        my = torch.min(boxes1[1], boxes2[1])
        My = torch.max(boxes1[3], boxes2[3])
        w1 = boxes1[2] - boxes1[0]
        h1 = boxes1[3] - boxes1[1]
        w2 = boxes2[2] - boxes2[0]
        h2 = boxes2[3] - boxes2[1]
    else:
        mx = torch.min(boxes1[0]-boxes1[2]/2.0, boxes2[0]-boxes2[2]/2.0)
        Mx = torch.max(boxes1[0]+boxes1[2]/2.0, boxes2[0]+boxes2[2]/2.0)
        my = torch.min(boxes1[1]-boxes1[3]/2.0, boxes2[1]-boxes2[3]/2.0)
        My = torch.max(boxes1[1]+boxes1[3]/2.0, boxes2[1]+boxes2[3]/2.0)
        w1 = boxes1[2]
        h1 = boxes1[3]
        w2 = boxes2[2]
        h2 = boxes2[3]
    uw = Mx - mx
    uh = My - my
    cw = w1 + w2 - uw
    ch = h1 + h2 - uh
    mask = ((cw <= 0) + (ch <= 0) > 0)
    area1 = w1 * h1
    area2 = w2 * h2
    carea = cw * ch
    carea[mask] = 0
    uarea = area1 + area2 - carea
    return carea/uarea

這裏的x1y1x2y2設置的是False,計算每個anchor和target的iou值,如果iou>一個給定的閾值,它被視爲包含一個對象,則conf_mask置爲0(具體見下loss分析),下面的一段代碼目前沒有看懂,可能大概是優化策略

    if seen < 12800:
       if anchor_step == 4:
           tx = torch.FloatTensor(anchors).view(nA, anchor_step).index_select(1, torch.LongTensor([2])).view(1,nA,1,1).repeat(nB,1,nH,nW)
           ty = torch.FloatTensor(anchors).view(num_anchors, anchor_step).index_select(1, torch.LongTensor([2])).view(1,nA,1,1).repeat(nB,1,nH,nW)
       else:
           tx.fill_(0.5)
           ty.fill_(0.5)
       tw.zero_()
       th.zero_()
       coord_mask.fill_(1)

接下來分batch,計算ground truth。針對每一批,某些類可能存在多個target。所以會存在循環:for t in xrange(50):
gx,gy是target 在特徵圖上的絕對座標,gi,gj是target在grid cell方格座標的左上角(這個是yolo的預測機制),gw和gh則是target 在特徵圖上的絕對寬高。接下來是獲取anchor,並計算IOU值,選擇具有最高IOU值的anchor,也就是這個iou值計算,是爲了選擇一個anchor,能和target產生最大的IOU,並選擇這個anchor進行預測,即“具體是哪個anchor box預測它,需要在訓練中確定,即由那個與ground truth的IOU最大的anchor box預測它”,這個我也在博文在Pytorch | yolov3原理及代碼詳解(二)中詳細分析過,不再贅述。目前anchor_step==4的情況我不能確定是什麼模式,大概率是使用4個值來表示一個anchor。
接下來得到targte在特徵圖上的絕對錶示:gt_box = [gx, gy, gw, gh],選擇具有最高IOU值的anchor預測的box:pred_box = pred_boxes[b*nAnchors+best_n*nPixels+gj*nW+gi]。並且,只考慮每個圖像的最佳anchor,如果iou值大於閾值,則認爲正確預測一個:

            # only consider the best anchor box, for each image
            coord_mask[b][best_n][gj][gi] = 1
            cls_mask[b][best_n][gj][gi] = 1
            # in this cell of the output feature map, there exists an object
            conf_mask[b][best_n][gj][gi] = object_scale
            tx[b][best_n][gj][gi] = target[b][t*5+1] * nW - gi
            ty[b][best_n][gj][gi] = target[b][t*5+2] * nH - gj
            tw[b][best_n][gj][gi] = math.log(gw/anchors[anchor_step*best_n])
            th[b][best_n][gj][gi] = math.log(gh/anchors[anchor_step*best_n+1])
            iou = bbox_iou(gt_box, pred_box, x1y1x2y2=False) # best_iou
            # confidence equals to iou of the corresponding anchor
            tconf[b][best_n][gj][gi] = iou
            tcls[b][best_n][gj][gi] = target[b][t*5]
            # if ious larger than 0.5, we justify it as a correct prediction
            if iou > 0.5:
                nCorrect = nCorrect + 1

計算完畢後,返回到RegionLoss類的forward中,保留那些置信度高(大於0.25)的邊界框作爲最終預測。接下來把變量放進GPU中。接着便是loss值的計算:

        # losses between predictions and targets (ground truth)
        # In total 6 aspects are considered as losses: 
        # 4 for bounding box location, 2 for prediction confidence and classification seperately
        loss_x = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(x*coord_mask, tx*coord_mask)/2.0
        loss_y = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(y*coord_mask, ty*coord_mask)/2.0
        loss_w = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(w*coord_mask, tw*coord_mask)/2.0
        loss_h = self.coord_scale * nn.SmoothL1Loss(reduction='sum')(h*coord_mask, th*coord_mask)/2.0
        loss_conf = nn.MSELoss(reduction='sum')(conf*conf_mask, tconf*conf_mask)/2.0

        # try focal loss with gamma = 2
        FL = FocalLoss(class_num=24, gamma=2, size_average=False)
        loss_cls = self.class_scale * FL(cls, tcls)

        # sum of loss
        loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

這裏的損失函數分爲3個部分:1.邊框損失:(x,y,w,h)。2.置信度損失。3.分類損失。這個損失函數不是原本的yolov2的損失函數。邊界框使用的是Smooth L1損失,目標檢測中, 如果存在異常點, 如預測4個點, 有一個點偏離很大, L2loss會平方誤差, 放大誤差, L1對誤差的魯棒性更好。本來coord_mask是爲了限制target和best anchor進行loss計算,但是當self.seen < 12800,會被全部置1coord_mask.fill_(1),我猜測是爲了前期訓練時,是爲了使那些即使不是best iou的anchor也迴歸到 target上,即有生成Better的anchor,說不定就會有新的best anchor從其中產生,而不是一開始就否定(但是那個tw.zero_()th.zero_()目前還沒有搞懂)。
關於yolov2的loss函數分析具體可見YOLOv2損失函數詳解,loss函數的具體形式爲:
losst=i=0Wj=0Hk=0A1Max IOU<Threshλnoobj(bijko)2+1t<12800λpriorr(x,y,w,h)(priorkrbijkr)2+1ktruth(λcoordrϵ(x,y,w,h)(truthrbijkr)2+λobj(IOUtruthkbijko)2+λclass(c=1c(truthcbijkc)2)\begin{array}{rl}\operatorname{loss}_{t}=\sum_{i=0}^{W} \sum_{j=0}^{H} \sum_{k=0}^{A} & 1_{\text {Max IOU<Thresh} } \lambda_{\text {noobj}} *\left(-b_{i j k}^{o}\right)^{2} \\ & +1_{t<12800} \lambda_{\text {prior}} * \sum_{r \in(x, y, w, h)}\left(\text {prior}_{k}^{r}-b_{i j k}^{r}\right)^{2} \\ + & 1_{k}^{\text {truth}}\left(\lambda_{\text {coord}} * \sum_{r \epsilon(x, y, w, h)}\left(\text {truth}^{r}-b_{i j k}^{r}\right)^{2}\right. \\ & +\lambda_{\text {obj}} *\left(IOU_{\text {truth}}^{k}-b_{i j k}^{o}\right)^{2} \\ & +\lambda_{\text {class}} *\left(\sum_{c=1}^{c}\left(\text {truth}^{c}-b_{i j k}^{c}\right)^{2}\right)\end{array}
其損失函數可以分爲三個部分:

  • 1.1Max lou<Threshλnoobj(bijko)21_{\text {Max lou}<\text {Thresh}} \lambda_{\text {noobj}} *\left(-b_{\text {ijk}}^{o}\right)^{2}
    這個loss是計算background的置信度誤差。這裏需要計算各個預測框和所有的ground truth之間的IOU值,並且取最大值記作MaxIOU,如果該值小於一定的閾值,YOLOv2論文取了0.6(本代碼中的sil_thresh),那麼這個預測框就標記爲background(可以這麼理解:如果所有anchor和target的IOU都太低,說明這些anchor就不適合預測target,那麼就應該去預測的是背景background):conf_mask_t[b][cur_ious>sil_thresh] = 0,同時注意到:conf_mask = torch.ones(nB, nA, nH, nW) * noobject_scalenoobject_scale的取值爲1,即λnoobj=1\lambda_{n o o b j}=1。這句話的含義是指,如果有物體λnoobj=0\lambda_{n o o b j}=0,反之,如果沒有物體λnoobj=1\lambda_{n o o b j}=1。那麼第一項就可以寫爲:
    λnoobji=0l.hl.wj=0ln1ijnoobjj(CiC^i)2+λobji=0l.hl.wj=0l.n1ijobj(CiC^i)2\lambda_{n o o b j} \sum_{i=0}^{l.h*l.w} \sum_{j=0}^{l \cdot n} 1_{i j}^{n o o b j j}\left(C_{i}-\hat{C}_{i}\right)^{2}+\lambda_{o b j} \sum_{i=0}^{l.h*l.w} \sum_{j=0}^{l.n} 1_{i j}^{o b j}\left(C_{i}-\hat{C}_{i}\right)^{2}
  • 2.+1t<12800λpriorrϵ(x,y,w,h)(priorkrbijkr)2+1_{t<12800} \lambda_{\text {prior}} * \sum_{r \epsilon(x, y, w, h)}\left(\text {prior}_{k}^{r}-b_{i j k}^{r}\right)^{2}
    這一部分是計算Anchor boxes和預測框的座標誤差,但是隻在前12800個iter計算,和我的猜測一樣,是爲了促進網絡學習到Anchor的形狀。
  • 3.+1ktruth(λcoordrϵ(x,y,w,h)(truthrbijkr)2+λobj(IOUtruthkbijko)2+λclass(c=1c(truthcbijkc)2))\begin{aligned}+1_{k}^{\text {truth}} &\left(\lambda_{\text {coord}} * \sum_{r \epsilon(x, y, w, h)}\left(\text {truth}^{r}-b_{i j k}^{r}\right)^{2}\right.\\ &+\lambda_{o b j} *\left(IOU_{\text {truth}}^{k}-b_{i j k}^{o}\right)^{2} \\ &\left.+\lambda_{\text {class}} *\left(\sum_{c=1}^{c}\left(\text {truth}^{c}-b_{i j k}^{c}\right)^{2}\right)\right) \end{aligned}
    這一部分計算的是和ground truth匹配的預測框各部分的損失總和,包括座標損失,置信度損失以及分類損失。座標損失上述已經提過使用了smooth L1 損失。關於置信度損失,增加了一項λobj\lambda_{obj}權重係數(代碼中的self.object_scale)。對於best anchor進設置conf_mask[b][best_n][gj][gi] = object_scale,本代碼其值爲5。但是注意到:conf_mask = torch.ones(nB, nA, nH, nW) * noobject_scale,出去被標記爲background的anchor被設置爲0以外,其餘anchor的conf_mask被設置爲1。當其爲1時,損失是預測框和ground truth的真實IOU值,當其爲5時,則應該計算的是best anchor與ground truth,設置爲5,應該是增加梯度,傾向於best anchor快速回歸到target上面,剩下的則是那些Max_IOU低於閾值的,被設置爲0,則進行忽略。最後的類別損失,和上述有點不同,是使用了Focal Loss,來解決類別分類不平衡的問題。

在計算完loss之後,則進行反向傳播進行優化,整個訓練流程基本分析完畢,剩下的則是test部分。


關於test部分的分析請見:
Pytorch|YOWO原理及代碼詳解(三)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章