title: Detr
author: yangsenius
original link: https://senyang-ml.github.io/2020/06/04/detr/
date: 2020-06-04 18:19:09
Detr (DEtection TRansformer) 是最近很受關注的一個工作。論文叫做「End-to-end object detection with Transformers」, Facebook Research目前把它投稿到了2020年的ECCV。
鑑於網上有太多關於DETR的解讀和評價,本文就不做太多的探討,而致力於分析這兩個概念:
- Set prediction and Hungarian Loss
- Permutation Invariance
Object detection set prediction loss
與以往的目標檢測模型不一樣,DETR模型推斷一個固定長度()的預測集合(可以理解爲長度的序列),即輸出個預測出的目標bbox和類別置信度,其中遠大於圖像中的真實目標數目。
Note: 原文中,爲了公式上的表達,作者假設真實目標數目也是,然後用來填充,表示非物體。
Hungarian Loss
Hungarian loss 是在這篇論文1提出,是bipartite matching and Hungarian Algorithm 第一次中被用到Deep Learning的檢測任務中。下面,我們來解釋一下Hungarian Loss的用法和含義。
如果需要了解「hungarian algorithm」,可以參考bipartite matching and hungarian algorithm。
Hungarian Algorithm是一種求解二分圖(加權)最大匹配的一種算法,它必然可以求出一個最優解。首先,根據字面上,我們不要陷入一個誤區:Hungarian Loss是用來優化最大匹配的,用損失函數的方式來梯度反傳求解最大匹配
。並不是這樣,實際上是:我們在使用Hungarian Algorithm求解出最大匹配之後,真實的目標框找到了與之相匹配的預測出的目標框,然後, 我們根據相匹配的的GT和Prediction,計算受損失函數。
計算的是兩者的類別置信度對應的交叉熵損失和bbox對應損失。
所謂Hungarian Loss就是
先用Hungarian Algorithm匹配,然後計算一個常規的loss。
所以它是一個兩步的過程:
- 第一步是用Hungarian Method求解最優匹配
- 第二步是固定匹配後,計算損失函數,進行梯度反傳
Sen Yang Note: 注意到一點,論文中在二分圖匹配時,是以GT爲參考,爲每個考慮其最好的匹配選擇, . 所以Prediction的下標直接使用了真實gt的下標的索引,代表gt所匹配的prediciton。所以不管預測序列的個元素的排列順序(permutation)是什麼樣的, Hungarian Method最後的匹配都是同一種最優解,用索引表示的好處就是,它可以來表達出
置換不變性(Permutation Invariant)
。這在論文中有提到。
我們上面的論文都可以在Detr中代碼中找到描述。
SetCriterion
的計算準則的說明文檔是:
class SetCriterion(nn.Module):
""" This class computes the loss for DETR.
The process happens in two steps:
1) we compute hungarian assignment between ground truth boxes and the outputs of the model
2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
"""
def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
""" Create the criterion.
Parameters:
num_classes: number of object categories, omitting the special no-object category
matcher: module able to compute a matching between targets and proposals
weight_dict: dict containing as key the names of the losses and as values their relative weight.
eos_coef: relative classification weight applied to the no-object category
losses: list of all the losses to be applied. See get_loss for list of available losses.
"""
HungarianMatcher
匹配的說明文檔和代碼是
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Modules to compute the matching cost and solve the corresponding LSAP.
"""
import torch
from scipy.optimize import linear_sum_assignment
from torch import nn
from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou
class HungarianMatcher(nn.Module):
"""This class computes an assignment between the targets and the predictions of the network
For efficiency reasons, the targets don't include the no_object. Because of this, in general,
there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
while the others are un-matched (and thus treated as non-objects).
"""
def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
"""Creates the matcher
Params:
cost_class: This is the relative weight of the classification error in the matching cost
cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
"""
super().__init__()
self.cost_class = cost_class
self.cost_bbox = cost_bbox
self.cost_giou = cost_giou
assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"
@torch.no_grad()
def forward(self, outputs, targets):
""" Performs the matching
Params:
outputs: This is a dict that contains at least these entries:
"pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
"pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates
targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
"labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
objects in the target) containing the class labels
"boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates
Returns:
A list of size batch_size, containing tuples of (index_i, index_j) where:
- index_i is the indices of the selected predictions (in order)
- index_j is the indices of the corresponding selected targets (in order)
For each batch element, it holds:
len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
"""
bs, num_queries = outputs["pred_logits"].shape[:2]
# We flatten to compute the cost matrices in a batch
out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1) # [batch_size * num_queries, num_classes]
out_bbox = outputs["pred_boxes"].flatten(0, 1) # [batch_size * num_queries, 4]
# Also concat the target labels and boxes
tgt_ids = torch.cat([v["labels"] for v in targets])
tgt_bbox = torch.cat([v["boxes"] for v in targets])
# Compute the classification cost. Contrary to the loss, we don't use the NLL,
# but approximate it in 1 - proba[target class].
# The 1 is a constant that doesn't change the matching, it can be ommitted.
cost_class = -out_prob[:, tgt_ids]
# Compute the L1 cost between boxes
cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)
# Compute the giou cost betwen boxes
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
# Final cost matrix
C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
C = C.view(bs, num_queries, -1).cpu()
sizes = [len(v["boxes"]) for v in targets]
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]
def build_matcher(args):
return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)
這個函數的目標就是我們剛纔講到的第一步,用Hungarian Method求解最優匹配。
函數的返回值是最後的最優匹配結果:index_i 和index_j的匹配結果:
A list of size batch_size, containing tuples of (index_i, index_j) where:
- index_i is the indices of the selected predictions (in order)
- index_j is the indices of the corresponding selected targets (in order)
For each batch element, it holds:
len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
因爲這個過程是不參與反向傳播的,是一個只有前向計算的過程,所以我們可以注意到
在前向計算中:
@torch.no_grad()
def forward(self, outputs, targets):
直接用@torch.no_grad()
取消了梯度的產生。
另外需要說明的是:scipy.optimize.linear_sum_assignment(cost_matrix)
就是一個用匈牙利算法進行分配的API。Detr直接調用了這個庫來求解匹配結果。
Permutation Invariant
我們先來講一下permutation invariant的概念。置換不變性是什麼意思呢?
舉一個簡單的例子:maxpooling就是一個對集合具有置換不變性的特性。
無論1,2,3,4,6這幾個元素構成序列的位置順序是怎麼樣的,取max的結果一定是6。
Positional Encoding
Detr提到了可以用embedding或者position encoding來消除置換不變性
,這是爲什麼呢?
因爲,位置嵌入編碼是可以影響置換不變性
,使其變爲置換同變性(Permutation Variant)
。
比如,我們給max操作,引入乘以位置位置下標的編碼方式:
即,位置編碼爲
Detr把圖像特徵的**[ Batchsize, Feature number, height, weight]
的形狀,展開成[Batchsize, height*weight, feature number]
**的形狀,其2D空間結構消失了,這種信息的丟失就必然需要用位置編碼的策略來表達原來完整的2D結構信息。Transformer的position encoding策略就可以達到滿足這個要求。
集合與序列
Position encoding
讓集合
變成了序列
,使得置換不變性(permutation-invariance)
消失,置換同變性(permutation-envariance)
產生。
後言
只關注性能的提升是學界的災難;
我在想,把從2012年到2020年所有頂會提出的改良性能的方法或者技巧,都組合在一起,在COCO detection的性能上的mAP就真得可以刷到100%嗎?或者說技巧們都是互不影響的增量式前進?
第一性原理可以促使我們不斷去對一個問題的本質追根溯源,而優秀方法的強大力場,又會讓我們活在前人思維框架的束縛當中。這幾年目標檢測方法,明顯能感受到某些框架在束縛着它們。而Detr是一次追根溯源的思考,儘管有很多相似的影子,但它還是充滿了活力。
原文地址: detr
本文先介紹了【Hungarian Loss】和【Permutation Invariant】,【Transformer】部分且聽下回分解。
End-to-end people detection in crowed scene, In CVPR 2015 ↩︎