Detr & End-to-end object detection with Transformers (1)

title: Detr
author: yangsenius
original link: https://senyang-ml.github.io/2020/06/04/detr/
date: 2020-06-04 18:19:09

Detr （DEtection TRansformer）是最近很受關注的一個工作。論文叫做「End-to-end object detection with Transformers」， Facebook Research目前把它投稿到了2020年的ECCV。

鑑於網上有太多關於DETR的解讀和評價，本文就不做太多的探討，而致力於分析這兩個概念：

Set prediction and Hungarian Loss
Permutation Invariance

Object detection set prediction loss

與以往的目標檢測模型不一樣，DETR模型推斷一個固定長度（ $N$ ）的預測集合（可以理解爲 $N$ 長度的序列），即輸出 $N$ 個預測出的目標bbox和類別置信度 $\hat{y}=\left(\hat{b},\hat{c}\right)$ ，其中 $N$ 遠大於圖像中的真實目標數目。

Note: 原文中，爲了公式上的表達，作者假設真實目標數目也是 $N$ ,然後用 $\varnothing$ 來填充，表示非物體。

Hungarian Loss

Hungarian loss 是在這篇論文¹提出，是bipartite matching and Hungarian Algorithm 第一次中被用到Deep Learning的檢測任務中。下面，我們來解釋一下Hungarian Loss的用法和含義。

如果需要了解「hungarian algorithm」，可以參考bipartite matching and hungarian algorithm。

Hungarian Algorithm是一種求解二分圖（加權）最大匹配的一種算法，它必然可以求出一個最優解。首先，根據字面上，我們不要陷入一個誤區：Hungarian Loss是用來優化最大匹配的，用損失函數的方式來梯度反傳求解最大匹配。並不是這樣，實際上是：我們在使用Hungarian Algorithm求解出最大匹配之後，真實的目標框找到了與之相匹配的預測出的目標框，然後, 我們根據相匹配的的GT和Prediction,計算受損失函數。計算的是兩者的類別置信度對應的交叉熵損失和bbox對應 $\mathcal{L}_{\mathrm{box}}(\cdot)$ 損失。

所謂Hungarian Loss就是

先用Hungarian Algorithm匹配，然後計算一個常規的loss。

所以它是一個兩步的過程：

第一步是用Hungarian Method求解最優匹配

$\hat{\sigma}=\underset{\sigma \in \mathfrak{S}_{N}}{\arg \min } \sum_{i}^{N} \mathcal{L}_{\mathrm{match}}\left(y_{i}, \hat{y}_{\sigma(i)}\right)$

$\mathcal{L}_{\mathrm{match}} \left(y_{i}, \hat{y}_{\sigma(i)}\right)=-\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \hat{p}_{\sigma(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\sigma(i)}\right)$

第二步是固定匹配後，計算損失函數，進行梯度反傳
$\mathcal{L}_{\text {Hungarian }}(y, \hat{y})=\sum_{i=1}^{N}\left[-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)+\mathbb{1}_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)\right]$

Sen Yang Note: 注意到一點，論文中在二分圖匹配時，是以GT爲參考，爲每個 $y_{i}$ 考慮其最好的匹配選擇, $\hat{y}_{\sigma(i)}$ . 所以Prediction的下標直接使用了真實gt的下標 $i$ 的索引 $\sigma(i)$ ，代表gt所匹配的prediciton。所以不管預測序列的 $N$ 個元素的排列順序(permutation)是什麼樣的, Hungarian Method最後的匹配都是同一種最優解，用索引 $\sigma(i)$ 表示的好處就是，它可以來表達出置換不變性(Permutation Invariant)。這在論文中有提到。

我們上面的論文都可以在Detr中代碼中找到描述。

SetCriterion的計算準則的說明文檔是：

class SetCriterion(nn.Module):
    """ This class computes the loss for DETR.
    The process happens in two steps:
        1) we compute hungarian assignment between ground truth boxes and the outputs of the model
        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
    """
    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses):
        """ Create the criterion.
        Parameters:
            num_classes: number of object categories, omitting the special no-object category
            matcher: module able to compute a matching between targets and proposals
            weight_dict: dict containing as key the names of the losses and as values their relative weight.
            eos_coef: relative classification weight applied to the no-object category
            losses: list of all the losses to be applied. See get_loss for list of available losses.
        """

HungarianMatcher匹配的說明文檔和代碼是

# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
"""
Modules to compute the matching cost and solve the corresponding LSAP.
"""
import torch
from scipy.optimize import linear_sum_assignment
from torch import nn

from util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou


class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network
    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher
        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"

    @torch.no_grad()
    def forward(self, outputs, targets):
        """ Performs the matching
        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates
            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates
        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2]

        # We flatten to compute the cost matrices in a batch
        out_prob = outputs["pred_logits"].flatten(0, 1).softmax(-1)  # [batch_size * num_queries, num_classes]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4]

        # Also concat the target labels and boxes
        tgt_ids = torch.cat([v["labels"] for v in targets])
        tgt_bbox = torch.cat([v["boxes"] for v in targets])

        # Compute the classification cost. Contrary to the loss, we don't use the NLL,
        # but approximate it in 1 - proba[target class].
        # The 1 is a constant that doesn't change the matching, it can be ommitted.
        cost_class = -out_prob[:, tgt_ids]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
        C = C.view(bs, num_queries, -1).cpu()

        sizes = [len(v["boxes"]) for v in targets]
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]


def build_matcher(args):
    return HungarianMatcher(cost_class=args.set_cost_class, cost_bbox=args.set_cost_bbox, cost_giou=args.set_cost_giou)

這個函數的目標就是我們剛纔講到的第一步，用Hungarian Method求解最優匹配。

函數的返回值是最後的最優匹配結果：index_i 和index_j的匹配結果:

A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)

因爲這個過程是不參與反向傳播的，是一個只有前向計算的過程，所以我們可以注意到

在前向計算中：

@torch.no_grad()
    def forward(self, outputs, targets):

直接用@torch.no_grad()取消了梯度的產生。

另外需要說明的是：scipy.optimize.linear_sum_assignment(cost_matrix)就是一個用匈牙利算法進行分配的API。Detr直接調用了這個庫來求解匹配結果。

Permutation Invariant

我們先來講一下permutation invariant的概念。置換不變性是什麼意思呢？

舉一個簡單的例子：maxpooling就是一個對集合具有置換不變性的特性。
$max\left\{1,3,4,2,6\right\}=max\left\{4,6,3,2,1\right\}=6$
無論1,2,3,4,6這幾個元素構成序列的位置順序是怎麼樣的，取max的結果一定是6。

Positional Encoding

Detr提到了可以用embedding或者position encoding來消除置換不變性，這是爲什麼呢？

因爲，位置嵌入編碼是可以影響置換不變性，使其變爲置換同變性(Permutation Variant)。

比如，我們給max操作，引入乘以位置位置下標的編碼方式：

即，位置編碼爲 $p_1,p_2,...,p_i=1,2,...,i$
$max\left\{1*p_1,3*p_2,4*p_3,2*p_4,6*p_5\right\}=30\\ max\left\{4*p_1,6*p_2,3*p_3,2*p_4,1*p_5\right\}=12$

Detr把圖像特徵的**[ Batchsize, Feature number, height, weight]的形狀，展開成[Batchsize, height*weight, feature number]**的形狀，其2D空間結構消失了，這種信息的丟失就必然需要用位置編碼的策略來表達原來完整的2D結構信息。Transformer的position encoding策略就可以達到滿足這個要求。

集合與序列

Position encoding讓集合變成了序列，使得置換不變性(permutation-invariance)消失，置換同變性(permutation-envariance)產生。

後言

只關注性能的提升是學界的災難；

我在想，把從2012年到2020年所有頂會提出的改良性能的方法或者技巧，都組合在一起，在COCO detection的性能上的mAP就真得可以刷到100%嗎？或者說技巧們都是互不影響的增量式前進？

第一性原理可以促使我們不斷去對一個問題的本質追根溯源，而優秀方法的強大力場，又會讓我們活在前人思維框架的束縛當中。這幾年目標檢測方法，明顯能感受到某些框架在束縛着它們。而Detr是一次追根溯源的思考，儘管有很多相似的影子，但它還是充滿了活力。

原文地址： detr

本文先介紹了【Hungarian Loss】和【Permutation Invariant】，【Transformer】部分且聽下回分解。

End-to-end people detection in crowed scene, In CVPR 2015 ↩︎

Detr & End-to-end object detection with Transformers (1)

Object detection set prediction loss

Hungarian Loss

Permutation Invariant

Positional Encoding

集合與序列

後言

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

.NET週刊【5月第2期 2024-05-12】

Detr & End-to-end object detection with Transformers (1)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結