1. 什麼是目標檢測

在圖像分類任務裏，假設圖像裏只有一個主體目標，並關注如何識別該目標的類別。然而，很多時候圖像裏有多個我們感興趣的目標，我們不僅想知道它們的類別，還想得到它們在圖像中的具體位置。在計算機視覺裏，我們將這類任務稱爲目標檢測（object detection）或物體檢測。

2. 目標檢測大致思路

基本思路大致可以分爲以下兩個步驟：

先把目標圖片中的物體用框標出來。
輸入到分類器中，對目標的類別進行分類

2.1 邊界框

邊界框是一個矩形框，可以由矩形左上角的 x 和 y 軸座標與右下角的 x 和 y 軸座標確定。我們根據上面的圖的座標信息來定義圖中狗和貓的邊界框。圖中的座標原點在圖像的左上角，原點往右和往下分別爲 x 軸和 y 軸的正方向。像這樣：

2.2 合理的邊界框？

目標檢測算法通常會在輸入圖像中採樣大量的區域，然後判斷這些區域中是否包含我們感興趣的目標，並調整區域邊緣從而更準確地預測目標的真實邊界框（ground-truth bounding box）。不同的模型使用的區域採樣方法可能不同。

2.2.1 某一種錨框的生成方法

2.2.1.1 方法描述

以每個像素爲中心生成多個大小和寬高比（aspect ratio）不同的邊界框。
假設我們輸入的圖像高爲 $h$ ，寬爲 $w$ 。那麼，我們分別以圖像的每個像素爲中心生成不同形狀的錨框。假設錨框佔總大小的 $s \in (0, 1]$ ，假設錨框的寬高比爲： $r$ ，那麼，可以計算出錨框的寬和高分別爲： $ws\sqrt{r}$ 和 $hs/\sqrt{r}$ ，而且，當中心位置固定時，生成的錨框是唯一的。

假設一組錨框的大小 $s_1,\ldots,s_n$ 和一組寬高比 $r_1,\ldots,r_m$ 。

如果以每個像素爲中心時使用所有的大小與寬高比的組合，輸入圖像將一共可以得到很多個錨框。雖然這些錨框可能覆蓋了所有的真實邊界框，但計算複雜度容易過高。因此，我們通常只對包含 $s_1$ 或 $r_1$ 的大小與寬高比的組合感興趣（~~沒想通爲什麼只對他們感興趣~~ ），即

$(s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1).$

也就是說，以相同像素爲中心的錨框的數量爲 $n+m-1$ 。對於整個輸入圖像，我們將一共生成 $wh(n+m-1)$ 個錨框。

爲什麼是 $m+n-1$ 個？

2.2.1.2 代碼

功能：指定輸入、一組大小和一組寬高比，該函數將返回輸入的所有錨框。

# 本函數已保存在d2lzh_pytorch包中方便以後使用
def MultiBoxPrior(feature_map, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5]):
    """
    # 按照「9.4.1. 生成多個錨框」所講的實現, anchor表示成(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        feature_map: torch tensor, Shape: [N, C, H, W].
        sizes: List of sizes (0~1) of generated MultiBoxPriores. 
        ratios: List of aspect ratios (non-negative) of generated MultiBoxPriores. 
    Returns:
        anchors of shape (1, num_anchors, 4). 由於batch裏每個都一樣, 所以第一維爲1
    """
    pairs = [] # pair of (size, sqrt(ration))
    for r in ratios:
        pairs.append([sizes[0], math.sqrt(r)])
    for s in sizes[1:]:
        pairs.append([s, math.sqrt(ratios[0])])
    
    pairs = np.array(pairs)
    
    ss1 = pairs[:, 0] * pairs[:, 1] # size * sqrt(ration)
    ss2 = pairs[:, 0] / pairs[:, 1] # size / sqrt(ration)
    
    base_anchors = np.stack([-ss1, -ss2, ss1, ss2], axis=1) / 2
    
    h, w = feature_map.shape[-2:]
    shifts_x = np.arange(0, w) / w
    shifts_y = np.arange(0, h) / h
    shift_x, shift_y = np.meshgrid(shifts_x, shifts_y)
    shift_x = shift_x.reshape(-1)
    shift_y = shift_y.reshape(-1)
    shifts = np.stack((shift_x, shift_y, shift_x, shift_y), axis=1)
    
    anchors = shifts.reshape((-1, 1, 4)) + base_anchors.reshape((1, -1, 4))
    
    return torch.tensor(anchors, dtype=torch.float32).view(1, -1, 4)

X = torch.Tensor(1, 3, h=40, w=40)  # 構造輸入數據
Y = MultiBoxPrior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
'''
Y.shape：torch.Size([1, 2042040, 4])
'''
boxes = Y.reshape((h, w, 5, 4))
'''
boxes.shape：(561, 728, 5, 4)
'''

示例：

X:(1, 3, 561, 728)：給定一張大小爲3個channel， $561 \times 728$ 像素的圖片。
sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5]：給定一組大小(0.75, 0.5, 0.25)，一組長寬比分別爲：(1, 2, 0.5)。

調用了MultiBoxPrior函數之後，生成了一組以 $40 \times 40$ 像素爲中心的，每個像素擁有 $3+3-1$ 個錨框，每個錨框用 $4$ 個座標來描述。

返回錨框變量y的形狀爲（批量大小，錨框個數，4）。將錨框變量y的形狀變爲（圖像高，圖像寬，以相同像素爲中心的錨框個數，4）後，我們就可以通過指定像素位置來獲取所有以該像素爲中心的錨框了。

上述代碼，來自於《動手學深度學習》—— pytorch版。

2.2.2 錨框優劣的評定

2.2.2.1 基本概念

如果該目標的真實邊界框已知，這裏的“較好”該如何量化呢？一種直觀的方法是衡量錨框和真實邊界框之間的相似度。我們知道，Jaccard係數（Jaccard index）可以衡量兩個集合的相似度。給定集合 $\mathcal{A}$ 和 $\mathcal{B}$ ，它們的Jaccard係數即二者交集大小除以二者並集大小：
$J(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}.$
實際上，我們可以把邊界框內的像素區域看成是像素的集合。可以用兩個邊界框的像素集合的Jaccard係數衡量這兩個邊界框的相似度。
通常將Jaccard係數稱爲交併比（intersection over union，IoU），即兩個邊界框相交面積與相併面積之比，如圖所示。交併比的取值範圍在0和1之間：0表示兩個邊界框無重合像素，1表示兩個邊界框相等。

2.2.2.2 代碼

代碼來自：代碼地址

# 以下函數已保存在d2lzh_pytorch包中方便以後使用
def compute_intersection(set_1, set_2):
    """
    計算anchor之間的交集
    Args:
        set_1: a tensor of dimensions (n1, 4), anchor表示成(xmin, ymin, xmax, ymax)
        set_2: a tensor of dimensions (n2, 4), anchor表示成(xmin, ymin, xmax, ymax)
    Returns:
        intersection of each of the boxes in set 1 with respect to each of the boxes in set 2, shape: (n1, n2)
    """
    # PyTorch auto-broadcasts singleton dimensions
    lower_bounds = torch.max(set_1[:, :2].unsqueeze(1), set_2[:, :2].unsqueeze(0))  # (n1, n2, 2)
    upper_bounds = torch.min(set_1[:, 2:].unsqueeze(1), set_2[:, 2:].unsqueeze(0))  # (n1, n2, 2)
    intersection_dims = torch.clamp(upper_bounds - lower_bounds, min=0)  # (n1, n2, 2)
    return intersection_dims[:, :, 0] * intersection_dims[:, :, 1]  # (n1, n2)


def compute_jaccard(set_1, set_2):
    """
    計算anchor之間的Jaccard係數(IoU)
    Args:
        set_1: a tensor of dimensions (n1, 4), anchor表示成(xmin, ymin, xmax, ymax)
        set_2: a tensor of dimensions (n2, 4), anchor表示成(xmin, ymin, xmax, ymax)
    Returns:
        Jaccard Overlap of each of the boxes in set 1 with respect to each of the boxes in set 2, shape: (n1, n2)
    """
    # Find intersections
    intersection = compute_intersection(set_1, set_2)  # (n1, n2)

    # Find areas of each box in both sets
    areas_set_1 = (set_1[:, 2] - set_1[:, 0]) * (set_1[:, 3] - set_1[:, 1])  # (n1)
    areas_set_2 = (set_2[:, 2] - set_2[:, 0]) * (set_2[:, 3] - set_2[:, 1])  # (n2)

    # Find the union
    # PyTorch auto-broadcasts singleton dimensions
    union = areas_set_1.unsqueeze(1) + areas_set_2.unsqueeze(0) - intersection  # (n1, n2)

    return intersection / union  # (n1, n2)

2.2.3 怎麼樣訓練錨框

2.2.3.1 基本概念

在訓練集中，我們將每個錨框視爲一個訓練樣本。需要爲每個錨框標註兩類標籤：一是錨框所含目標的類別，簡稱類別；二是真實邊界框相對錨框的偏移量，簡稱偏移量（offset）。

大致思路：在目標檢測時，

我們首先生成多個錨框
爲每個錨框預測類別以及偏移量
根據預測的偏移量調整錨框位置從而得到預測邊界框
篩選需要輸出的預測邊界框。

在目標檢測的訓練集中，每個圖像已標註了真實邊界框的位置以及所含目標的類別。在生成錨框之後，我們主要依據與錨框相似的真實邊界框的位置和類別信息爲錨框標註。那麼，該如何爲錨框分配與其相似的真實邊界框呢？

解決辦法是用一個矩陣存儲“我們猜測的錨框位置”和“真實的錨框位置”的交併比。每次從這個矩陣中選出最大的交併比，然後剔除該最大值所在的行和列，重複這個步驟 $n$ 次就可以得到對應的最佳錨框。

確定了標註框之後，就可以進行錨框的類別和偏移量的標註了。

例如，如果一個錨框 𝐴 被分配了真實邊界框 𝐵 ，將錨框 𝐴 的類別設爲 𝐵 的類別，並根據 𝐵 和 𝐴 的中心座標的相對位置以及兩個框的相對大小爲錨框 𝐴 標註偏移量。

由於數據集中各個框的位置和大小各異，因此這些相對位置和相對大小通常需要一些特殊變換，才能使偏移量的分佈更均勻從而更容易擬合。設錨框 𝐴 及其被分配的真實邊界框 𝐵 的中心座標分別爲 $(x_a,y_a)$ 和 $(x_b,y_b)$ ， 𝐴 和 𝐵 的寬分別爲 $w_a$ 和 $w_b$ ，高分別爲 $h_a$ 和 $h_a,h_b$ ，一個常用的技巧是將 𝐴 的偏移量標註爲
$\left( \frac{ \frac{x_b - x_a}{w_a} - \mu_x }{\sigma_x}, \frac{ \frac{y_b - y_a}{h_a} - \mu_y }{\sigma_y}, \frac{ \log \frac{w_b}{w_a} - \mu_w }{\sigma_w}, \frac{ \log \frac{h_b}{h_a} - \mu_h }{\sigma_h}\right)$
其中常數的默認值爲 $\mu_x = \mu_y = \mu_w = \mu_h = 0, \sigma_x=\sigma_y=0.1, \sigma_w=\sigma_h=0.2$ 。如果一個錨框沒有被分配真實邊界框，我們只需將該錨框的類別設爲背景。類別爲背景的錨框通常被稱爲負類錨框，其餘則被稱爲正類錨框。

2.2.3.2 代碼實現

# 以下函數已保存在d2lzh_pytorch包中方便以後使用
def assign_anchor(bb, anchor, jaccard_threshold=0.5):
    """
    # 按照「9.4.1. 生成多個錨框」圖9.3所講爲每個anchor分配真實的bb, anchor表示成歸一化(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        bb: 真實邊界框(bounding box), shape:（nb, 4）
        anchor: 待分配的anchor, shape:（na, 4）
        jaccard_threshold: 預先設定的閾值
    Returns:
        assigned_idx: shape: (na, ), 每個anchor分配的真實bb對應的索引, 若未分配任何bb則爲-1
    """
    na = anchor.shape[0]
    nb = bb.shape[0]
    jaccard = compute_jaccard(anchor, bb).detach().cpu().numpy() # shape: (na, nb)
    assigned_idx = np.ones(na) * -1  # 初始全爲-1
    
    # 先爲每個bb分配一個anchor(不要求滿足jaccard_threshold)
    jaccard_cp = jaccard.copy()
    for j in range(nb):
        i = np.argmax(jaccard_cp[:, j])
        assigned_idx[i] = j
        jaccard_cp[i, :] = float("-inf") # 賦值爲負無窮, 相當於去掉這一行
     
    # 處理還未被分配的anchor, 要求滿足jaccard_threshold
    for i in range(na):
        if assigned_idx[i] == -1:
            j = np.argmax(jaccard[i, :])
            if jaccard[i, j] >= jaccard_threshold:
                assigned_idx[i] = j
    
    return torch.tensor(assigned_idx, dtype=torch.long)


def xy_to_cxcy(xy):
    """
    將(x_min, y_min, x_max, y_max)形式的anchor轉換成(center_x, center_y, w, h)形式的.
    https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/utils.py
    Args:
        xy: bounding boxes in boundary coordinates, a tensor of size (n_boxes, 4)
    Returns: 
        bounding boxes in center-size coordinates, a tensor of size (n_boxes, 4)
    """
    return torch.cat([(xy[:, 2:] + xy[:, :2]) / 2,  # c_x, c_y
                      xy[:, 2:] - xy[:, :2]], 1)  # w, h

def MultiBoxTarget(anchor, label):
    """
    # 按照「9.4.1. 生成多個錨框」所講的實現, anchor表示成歸一化(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        anchor: torch tensor, 輸入的錨框, 一般是通過MultiBoxPrior生成, shape:（1，錨框總數，4）
        label: 真實標籤, shape爲(bn, 每張圖片最多的真實錨框數, 5)
               第二維中，如果給定圖片沒有這麼多錨框, 可以先用-1填充空白, 最後一維中的元素爲[類別標籤, 四個座標值]
    Returns:
        列表, [bbox_offset, bbox_mask, cls_labels]
        bbox_offset: 每個錨框的標註偏移量，形狀爲(bn，錨框總數*4)
        bbox_mask: 形狀同bbox_offset, 每個錨框的掩碼, 一一對應上面的偏移量, 負類錨框(背景)對應的掩碼均爲0, 正類錨框的掩碼均爲1
        cls_labels: 每個錨框的標註類別, 其中0表示爲背景, 形狀爲(bn，錨框總數)
    """
    assert len(anchor.shape) == 3 and len(label.shape) == 3
    bn = label.shape[0]
    
    def MultiBoxTarget_one(anc, lab, eps=1e-6):
        """
        MultiBoxTarget函數的輔助函數, 處理batch中的一個
        Args:
            anc: shape of (錨框總數, 4)
            lab: shape of (真實錨框數, 5), 5代表[類別標籤, 四個座標值]
            eps: 一個極小值, 防止log0
        Returns:
            offset: (錨框總數*4, )
            bbox_mask: (錨框總數*4, ), 0代表背景, 1代表非背景
            cls_labels: (錨框總數, 4), 0代表背景
        """
        an = anc.shape[0]
        assigned_idx = assign_anchor(lab[:, 1:], anc) # (錨框總數, )
        bbox_mask = ((assigned_idx >= 0).float().unsqueeze(-1)).repeat(1, 4) # (錨框總數, 4)

        cls_labels = torch.zeros(an, dtype=torch.long) # 0表示背景
        assigned_bb = torch.zeros((an, 4), dtype=torch.float32) # 所有anchor對應的bb座標
        for i in range(an):
            bb_idx = assigned_idx[i]
            if bb_idx >= 0: # 即非背景
                cls_labels[i] = lab[bb_idx, 0].long().item() + 1 # 注意要加一
                assigned_bb[i, :] = lab[bb_idx, 1:]

        center_anc = xy_to_cxcy(anc) # (center_x, center_y, w, h)
        center_assigned_bb = xy_to_cxcy(assigned_bb)

        offset_xy = 10.0 * (center_assigned_bb[:, :2] - center_anc[:, :2]) / center_anc[:, 2:]
        offset_wh = 5.0 * torch.log(eps + center_assigned_bb[:, 2:] / center_anc[:, 2:])
        offset = torch.cat([offset_xy, offset_wh], dim = 1) * bbox_mask # (錨框總數, 4)

        return offset.view(-1), bbox_mask.view(-1), cls_labels
    
    batch_offset = []
    batch_mask = []
    batch_cls_labels = []
    for b in range(bn):
        offset, bbox_mask, cls_labels = MultiBoxTarget_one(anchor[0, :, :], label[b, :, :])
        
        batch_offset.append(offset)
        batch_mask.append(bbox_mask)
        batch_cls_labels.append(cls_labels)
    
    bbox_offset = torch.stack(batch_offset)
    bbox_mask = torch.stack(batch_mask)
    cls_labels = torch.stack(batch_cls_labels)
    
    return [bbox_offset, bbox_mask, cls_labels]

# ground_truth 真實的邊界框
ground_truth = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],
                            [1, 0.55, 0.2, 0.9, 0.88]])
# anchors 生成的邊界框
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
                    [0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
                    [0.57, 0.3, 0.92, 0.9]])
labels = MultiBoxTarget(anchors.unsqueeze(dim=0), ground_truth.unsqueeze(dim=0))

MultiBoxTarget函數來爲錨框標註類別和偏移量。該函數將背景類別設爲0，並令從零開始的目標類別的整數索引自加1（1爲狗，2爲貓）。我們通過unsqueeze函數爲錨框和真實邊界框添加樣本維，並構造形狀爲(批量大小, 包括背景的類別個數, 錨框數)的任意預測結果。
MultiBoxTarget函數的輸入爲：

返回的結果有：

每個錨框標註的四個偏移量
其中負類錨框的偏移量標註爲0。
掩碼變量。
下面演示一個具體的例子。我們爲讀取的圖像中的貓和狗定義真實邊界框，其中第一個元素爲類別（0爲狗，1爲貓），剩餘4個元素分別爲左上角的 $x$ 和 $y$ 軸座標以及右下角的 $x$ 和 $y$ 軸座標（值域在0到1之間）。這裏通過左上角和右下角的座標構造了5個需要標註的錨框，分別記爲 $A_0, \ldots, A_4$ （程序中索引從0開始）。先畫出這些錨框與真實邊界框在圖像中的位置。

根據錨框與真實邊界框在圖像中的位置來分析這些標註的類別。

在所有的“錨框—真實邊界框”的配對中，錨框 $A_4$ 與貓的真實邊界框的交併比最大，因此錨框 $A_4$ 的類別標註爲貓。
不考慮錨框 $A_4$ 或貓的真實邊界框，在剩餘的“錨框—真實邊界框”的配對中，最大交併比的配對爲錨框 $A_1$ 和狗的真實邊界框，因此錨框 $A_1$ 的類別標註爲狗。
接下來遍歷未標註的剩餘3個錨框：與錨框 $A_0$ 交併比最大的真實邊界框的類別爲狗，但交併比小於閾值（默認爲0.5），因此類別標註爲背景；
與錨框 $A_2$ 交併比最大的真實邊界框的類別爲貓，且交併比大於閾值，因此類別標註爲貓；
與錨框 $A_3$ 交併比最大的真實邊界框的類別爲貓，但交併比小於閾值，因此類別標註爲背景。

每個錨框標註的類別

2.2.4 輸出預測邊界框

2.2.4.1 基本概念

在模型預測階段，我們先爲圖像生成多個錨框，併爲這些錨框一一預測類別和偏移量。隨後，我們根據錨框及其預測偏移量得到預測邊界框。當錨框數量較多時，同一個目標上可能會輸出較多相似的預測邊界框。爲了使結果更加簡潔，我們可以移除相似的預測邊界框。常用的方法叫作非極大值抑制（non-maximum suppression，NMS）。

非極大值抑制的工作原理

通俗的說：每個框都有一個所屬類別的概率值，根據概率值先確定一個框的類別，然後把與這個框相交、且概率值大於閾值的框全都扔掉。最後剩下的框就是沒有重複的預測邊界框了。

對於一個預測邊界框 $B$ ，模型會計算各個類別的預測概率。設其中最大的預測概率爲 $p$ ，該概率所對應的類別即 $B$ 的預測類別。我們也將 $p$ 稱爲預測邊界框 $B$ 的置信度。

在同一圖像上，我們將預測類別非背景的預測邊界框按置信度從高到低排序，得到列表 $L$ 。從 $L$ 中選取置信度最高的預測邊界框 $B_1$ 作爲基準，將所有與 $B_1$ 的交併比大於某閾值的非基準預測邊界框從 $L$ 中移除。這裏的閾值是預先設定的超參數。此時， $L$ 保留了置信度最高的預測邊界框並移除了與其相似的其他預測邊界框。

接下來，從 $L$ 中選取置信度第二高的預測邊界框 $B_2$ 作爲基準，將所有與 $B_2$ 的交併比大於某閾值的非基準預測邊界框從 $L$ 中移除。重複這一過程，直到 $L$ 中所有的預測邊界框都曾作爲基準。此時 $L$ 中任意一對預測邊界框的交併比都小於閾值。最終，輸出列表 $L$ 中的所有預測邊界框。

2.2.4.2 代碼實現

# 以下函數已保存在d2lzh_pytorch包中方便以後使用
from collections import namedtuple
Pred_BB_Info = namedtuple("Pred_BB_Info", ["index", "class_id", "confidence", "xyxy"])

def non_max_suppression(bb_info_list, nms_threshold = 0.5):
    """
    非極大抑制處理預測的邊界框
    Args:
        bb_info_list: Pred_BB_Info的列表, 包含預測類別、置信度等信息
        nms_threshold: 閾值
    Returns:
        output: Pred_BB_Info的列表, 只保留過濾後的邊界框信息
    """
    output = []
    # 先根據置信度從高到低排序
    sorted_bb_info_list = sorted(bb_info_list, key = lambda x: x.confidence, reverse=True)

    while len(sorted_bb_info_list) != 0:
        best = sorted_bb_info_list.pop(0)
        output.append(best)
        
        if len(sorted_bb_info_list) == 0:
            break

        bb_xyxy = []
        for bb in sorted_bb_info_list:
            bb_xyxy.append(bb.xyxy)
        
        iou = compute_jaccard(torch.tensor([best.xyxy]), 
                              torch.tensor(bb_xyxy))[0] # shape: (len(sorted_bb_info_list), )
        
        n = len(sorted_bb_info_list)
        sorted_bb_info_list = [sorted_bb_info_list[i] for i in range(n) if iou[i] <= nms_threshold]
    return output

def MultiBoxDetection(cls_prob, loc_pred, anchor, nms_threshold = 0.5):
    """
    # 按照「9.4.1. 生成多個錨框」所講的實現, anchor表示成歸一化(xmin, ymin, xmax, ymax).
    https://zh.d2l.ai/chapter_computer-vision/anchor.html
    Args:
        cls_prob: 經過softmax後得到的各個錨框的預測概率, shape:(bn, 預測總類別數+1, 錨框個數)
        loc_pred: 預測的各個錨框的偏移量, shape:(bn, 錨框個數*4)
        anchor: MultiBoxPrior輸出的默認錨框, shape: (1, 錨框個數, 4)
        nms_threshold: 非極大抑制中的閾值
    Returns:
        所有錨框的信息, shape: (bn, 錨框個數, 6)
        每個錨框信息由[class_id, confidence, xmin, ymin, xmax, ymax]表示
        class_id=-1 表示背景或在非極大值抑制中被移除了
    """
    assert len(cls_prob.shape) == 3 and len(loc_pred.shape) == 2 and len(anchor.shape) == 3
    bn = cls_prob.shape[0]
    
    def MultiBoxDetection_one(c_p, l_p, anc, nms_threshold = 0.5):
        """
        MultiBoxDetection的輔助函數, 處理batch中的一個
        Args:
            c_p: (預測總類別數+1, 錨框個數)
            l_p: (錨框個數*4, )
            anc: (錨框個數, 4)
            nms_threshold: 非極大抑制中的閾值
        Return:
            output: (錨框個數, 6)
        """
        pred_bb_num = c_p.shape[1]
        anc = (anc + l_p.view(pred_bb_num, 4)).detach().cpu().numpy() # 加上偏移量
        
        confidence, class_id = torch.max(c_p, 0)
        confidence = confidence.detach().cpu().numpy()
        class_id = class_id.detach().cpu().numpy()
        
        pred_bb_info = [Pred_BB_Info(
                            index = i,
                            class_id = class_id[i] - 1, # 正類label從0開始
                            confidence = confidence[i],
                            xyxy=[*anc[i]]) # xyxy是個列表
                        for i in range(pred_bb_num)]
        
        # 正類的index
        obj_bb_idx = [bb.index for bb in non_max_suppression(pred_bb_info, nms_threshold)]
        
        output = []
        for bb in pred_bb_info:
            output.append([
                (bb.class_id if bb.index in obj_bb_idx else -1.0),
                bb.confidence,
                *bb.xyxy
            ])
            
        return torch.tensor(output) # shape: (錨框個數, 6)
    
    batch_output = []
    for b in range(bn):
        batch_output.append(MultiBoxDetection_one(cls_prob[b], loc_pred[b], anchor[0], nms_threshold))
    
    return torch.stack(batch_output)

# 先構造4個錨框。簡單起見，我們假設預測偏移量全是0：預測邊界框即錨框。最後，我們構造每個類別的預測概率。
anchors = nd.array([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
                    [0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
offset_preds = nd.array([0] * anchors.size)
cls_probs = nd.array([[0] * 4,  # 背景的預測概率
                      [0.9, 0.8, 0.7, 0.1],  # 狗的預測概率
                      [0.1, 0.2, 0.3, 0.9]])  # 貓的預測概率
output = MultiBoxDetection(
    cls_probs.unsqueeze(dim=0), offset_preds.unsqueeze(dim=0),
    anchors.unsqueeze(dim=0), nms_threshold=0.5)

先構造4個錨框。假設

預測偏移量offset_preds全是0。
預測邊界框即錨框anchors。
每個類別的預測概率cls_probs。

MultiBoxDetection函數來執行非極大值抑制並設閾值爲0.5。這裏爲輸入都增加了樣本維。我們看到，返回的結果的形狀爲(批量大小, 錨框個數, 6)。其中每一行的6個元素代表同一個預測邊界框的輸出信息。

第一個元素是索引從0開始計數的預測類別（0爲狗，1爲貓），其中-1表示背景或在非極大值抑制中被移除。
第二個元素是預測邊界框的置信度。
剩餘的4個元素分別是預測邊界框左上角的 𝑥 和 𝑦 軸座標以及右下角的 𝑥 和 𝑦 軸座標（值域在0到1之間）。

3. 多尺度目標檢測

3.1 小目標檢測

爲了在顯示時更容易分辨，這裏令不同中心的錨框不重合：設錨框大小爲0.15，特徵圖的高和寬分別爲4。可以看出，圖像上4行4列的錨框中心分佈均勻。

3.2 中等尺度目標檢測

我們將特徵圖的高和寬分別減半，並用更大的錨框檢測更大的目標。當錨框大小設0.4時，有些錨框的區域有重合。

3.2 大尺度目標檢測

4. 參考鏈接

《動手學深度學習》—— 目標檢測和邊界框

計算機視覺（一）—— 目標檢測和邊界框

文章目錄