Detr & End-to-end object detection with Transformers (1)


title: Detr
author: yangsenius
original link: https://senyang-ml.github.io/2020/06/04/detr/
date: 2020-06-04 18:19:09


Detr (DEtection TRansformer) 是最近很受关注的一个工作。论文叫做「End-to-end object detection with Transformers」, Facebook Research目前把它投稿到了2020年的ECCV。

鉴于网上有太多关于DETR的解读和评价,本文就不做太多的探讨,而致力于分析这两个概念:

  • Set prediction and Hungarian Loss
  • Permutation Invariance

Object detection set prediction loss

与以往的目标检测模型不一样,DETR模型推断一个固定长度(NN)的预测集合(可以理解为NN长度的序列),即输出NN个预测出的目标bbox和类别置信度y^=(b^,c^)\hat{y}=\left(\hat{b},\hat{c}\right),其中NN远大于图像中的真实目标数目。

Note: 原文中,为了公式上的表达,作者假设真实目标数目也是NN,然后用\varnothing来填充,表示非物体。

Hungarian Loss

Hungarian loss 是在这篇论文1提出,是bipartite matching and Hungarian Algorithm 第一次中被用到Deep Learning的检测任务中。下面,我们来解释一下Hungarian Loss的用法和含义。

如果需要了解「hungarian algorithm」,可以参考bipartite matching and hungarian algorithm

Hungarian Algorithm是一种求解二分图(加权)最大匹配的一种算法,它必然可以求出一个最优解。首先,根据字面上,我们不要陷入一个误区:Hungarian Loss是用来优化最大匹配的,用损失函数的方式来梯度反传求解最大匹配。并不是这样,实际上是:我们在使用Hungarian Algorithm求解出最大匹配之后,真实的目标框找到了与之相匹配的预测出的目标框,然后, 我们根据相匹配的的GT和Prediction,计算受损失函数。计算的是两者的类别置信度对应的交叉熵损失和bbox对应Lbox()\mathcal{L}_{\mathrm{box}}(\cdot)损失。

所谓Hungarian Loss就是

先用Hungarian Algorithm匹配,然后计算一个常规的loss。

所以它是一个两步的过程:

  • 第一步是用Hungarian Method求解最优匹配

σ^=argminσSNiNLmatch(yi,y^σ(i)) \hat{\sigma}=\underset{\sigma \in \mathfrak{S}_{N}}{\arg \min } \sum_{i}^{N} \mathcal{L}_{\mathrm{match}}\left(y_{i}, \hat{y}_{\sigma(i)}\right)

Lmatch(yi,y^σ(i))=1{ci}p^σ(i)(ci)+1{ci}Lbox(bi,b^σ(i)) \mathcal{L}_{\mathrm{match}} \left(y_{i}, \hat{y}_{\sigma(i)}\right)=-\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \hat{p}_{\sigma(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\sigma(i)}\right)

  • 第二步是固定匹配后,计算损失函数,进行梯度反传
    LHungarian (y,y^)=i=1N[logp^σ^(i)(ci)+1{ci}Lbox(bi,b^σ^(i))] \mathcal{L}_{\text {Hungarian }}(y, \hat{y})=\sum_{i=1}^{N}\left[-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)\right]

Sen Yang Note: 注意到一点,论文中在二分图匹配时,是以GT为参考,为每个yiy_{i}考虑其最好的匹配选择, y^σ(i)\hat{y}_{\sigma(i)}. 所以Prediction的下标直接使用了真实gt的下标ii的索引σ(i)\sigma(i),代表gt所匹配的prediciton。所以不管预测序列的NN个元素的排列顺序(permutation)是什么样的, Hungarian Method最后的匹配都是同一种最优解,用索引σ(i)\sigma(i)表示的好处就是,它可以来表达出置换不变性(Permutation Invariant)。这在论文中有提到。

我们上面的论文都可以在Detr中代码中找到描述。

SetCriterion的计算准则的说明文档是:

class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """

HungarianMatcher匹配的说明文档和代码是

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Modules to compute the matching cost and solve the corresponding LSAP.
"""
import torch
from scipy.optimize import linear_sum_assignment
from torch import nn

from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou


class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network
    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher
        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"

    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching
        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates
            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates
        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        tgt_ids = torch.cat([v["labels"] for v in targets])
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()

        sizes = [len(v["boxes"]) for v in targets]
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]


def build_matcher(args):
    return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)

这个函数的目标就是我们刚才讲到的第一步,用Hungarian Method求解最优匹配。

函数的返回值是最后的最优匹配结果:index_i 和index_j的匹配结果:

A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)

因为这个过程是不参与反向传播的,是一个只有前向计算的过程,所以我们可以注意到

在前向计算中:

@torch.no_grad()
    def forward(self, outputs, targets):

直接用@torch.no_grad()取消了梯度的产生。

另外需要说明的是:scipy.optimize.linear_sum_assignment(cost_matrix)就是一个用匈牙利算法进行分配的API。Detr直接调用了这个库来求解匹配结果。

Permutation Invariant

我们先来讲一下permutation invariant的概念。置换不变性是什么意思呢?

举一个简单的例子:maxpooling就是一个对集合具有置换不变性的特性
max{1,3,4,2,6}=max{4,6,3,2,1}=6 max\left\{1,3,4,2,6\right\}=max\left\{4,6,3,2,1\right\}=6
无论1,2,3,4,6这几个元素构成序列的位置顺序是怎么样的,取max的结果一定是6。

Positional Encoding

Detr提到了可以用embedding或者position encoding来消除置换不变性,这是为什么呢?

因为,位置嵌入编码是可以影响置换不变性,使其变为置换同变性(Permutation Variant)

比如,我们给max操作,引入乘以位置位置下标的编码方式:

即,位置编码为p1,p2,...,pi=1,2,...,ip_1,p_2,...,p_i=1,2,...,i
max{1p1,3p2,4p3,2p4,6p5}=30max{4p1,6p2,3p3,2p4,1p5}=12 max\left\{1*p_1,3*p_2,4*p_3,2*p_4,6*p_5\right\}=30\\ max\left\{4*p_1,6*p_2,3*p_3,2*p_4,1*p_5\right\}=12

Detr把图像特征的**[ Batchsize, Feature number, height, weight]的形状,展开成[Batchsize, height*weight, feature number]**的形状,其2D空间结构消失了,这种信息的丢失就必然需要用位置编码的策略来表达原来完整的2D结构信息。Transformer的position encoding策略就可以达到满足这个要求。

集合与序列

Position encoding集合变成了序列,使得置换不变性(permutation-invariance)消失,置换同变性(permutation-envariance)产生。

后言

只关注性能的提升是学界的灾难;

我在想,把从2012年到2020年所有顶会提出的改良性能的方法或者技巧,都组合在一起,在COCO detection的性能上的mAP就真得可以刷到100%吗?或者说技巧们都是互不影响的增量式前进?

第一性原理可以促使我们不断去对一个问题的本质追根溯源,而优秀方法的强大力场,又会让我们活在前人思维框架的束缚当中。这几年目标检测方法,明显能感受到某些框架在束缚着它们。而Detr是一次追根溯源的思考,尽管有很多相似的影子,但它还是充满了活力。

原文地址: detr

本文先介绍了【Hungarian Loss】和【Permutation Invariant】,【Transformer】部分且听下回分解。


  1. End-to-end people detection in crowed scene, In CVPR 2015 ↩︎

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章