Associative Embedding:End-to-End Learning for Joint Detection and Grouping論文翻譯

Abstract:

 

We introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping.

我們提出了一中關聯嵌入的新穎的方法,應用在監督學習卷積神經網絡中的檢測和分組任務中.

A number of computer vision problems can be framed in this manner including multi-person pose estimation, instance segmentation, and multi-object tracking.

可以以這種方式構建,能夠應用於解決許多計算機視覺問題,包括多人姿勢估計,實例分割和多對象跟蹤。

This technique can be easily integrated into any state-of-the-art network architecture that produces pixel-wise predictions.

這個方法,可以被很容易地用於逐像素預測的任何優秀的網絡架構中.

we show how to apply this method to both multi-person pose estimation and instance segmenta tion and report state-of-the-art performance for multi-person pose on the MPII and MS-COCO datasets.

通過在MPII和MS-COCO數據集上的實驗,表明應用這種方法可以同時對於多人姿態識別和實力分割,都取得了非常良好的效果.

 

1.introduce

Many computer vision tasks can be viewed as joint detection and grouping: detecting smaller visual units and grouping them into larger structures. For example, multi-person pose estimation can be viewed as detecting body joints and grouping them into individual people; instance segmentation can be viewed as detecting relevant pixels and grouping them into object instances; multi-object tracking can be viewed as detecting object instances and grouping them into tracks. In all of these cases, the output is a variable number of visual units and their assignment into a variable number of visual groups.

許多計算機視覺任務可以被視爲聯合檢測和分組:檢測較小的視覺單元並將它們分組爲更大的結構。例如,可以將多人姿勢估計視爲檢測身體關節並將其分組爲個體; 實例分割可以被視爲檢測相關像素並將它們分組到對象實例中;多對象跟蹤可以被視爲檢測對象實例並將它們分組到軌道中。在所有這些情況下,輸出是可變數量的可視單元,並將它們分配到可變數量的可視組中。

Such tasks are often approached with two-stage pipelines that perform detection first and grouping second. But such approaches may be suboptimal because detection and grouping are usually tightly coupled: for example, in multiperson pose estimation, a wrist detection is likely a false positive if there is not an elbow detection nearby to group with.

這些任務通常採用兩級管道進行處理,這兩條管道首先執行檢測,然後進行分組。但是這樣的方法可能是次優的,因爲檢測和分組通常緊密耦合:例如,在多人姿勢估計中,如果在組附近沒有肘檢測,則手腕檢測可能是假陽性。

In this paper we ask whether it is possible to jointly perform detection and grouping using a single-stage deep network trained end-to-end. We propose associative embedding, a novel method to represent the output of joint detection and grouping. The basic idea is to introduce, for each detection, a real number that serves as a “tag” to identify the group the

detection belongs to. In other words, the tags associate each detection with other detections in the same group.

在本文中,我們探討是否有可能使用單端深度網絡訓練的端到端聯合執行檢測和分組。我們提出了關聯嵌入一種表示聯合檢測和分組輸出的新方法。基本思想是爲每次檢測引入一個實數,作爲“標籤”來識別檢測所屬的組。換句話說,標籤將每個檢測與同一組中的其他檢測相關聯。

Consider the special case of detections in 2D and embeddings in 1D (real numbers). The network outputs both a heatmap of per-pixel detection scores and a heatmap of per-pixel identity tags. The detections and groups are then decoded from these two heatmaps.

考慮2D中的檢測和1D(實數)嵌入的特殊情況。網絡輸出每像素檢測分數的熱圖和每像素身份標籤的熱圖。然後從這兩個熱圖解碼檢測和組。

To train a network to predict the tags, we use a loss function that encourages pairs of tags to have similar values if the corresponding detections belong to the same group in the ground truth or dissimilar values otherwise. It is important to note that we have no “ground truth” tags for the network to predict, because what matters is not the particular tag values, only the differences between them. The network has the freedom to decide on the tag values as long as they agree with the ground truth grouping.

爲了訓練網絡來預測標籤,我們使用一種損失函數,如果相應的檢測屬於背景實況中的相同組或不相同的值,則鼓勵成對的標籤具有相似的值。值得注意的是,我們沒有網絡預測的“基本事實”標籤,因爲重要的不是特定的標籤值,只是它們之間的差異。只要他們同意背景真實分組,網絡就可以自由決定標籤值。

We apply our approach to multiperson pose estimation, an important task for understanding humans in images. Concretely, given an input image, multi-person pose estimation seeks to detect each person and localize their body joints. Unlike single-person pose there are no prior assumptions of a person’s location or size. Multi-person pose systems must scan the whole image detecting all people and their corresponding keypoints. For this task, we integrate associative embedding with a stacked hourglass network [31], which produces a detection heatmap and a tagging heatmap for each body joint, and then groups body joints with similar tags into individual people. Experiments demonstrate that our approach outperforms all recent methods and achieves state of the art results on MS-COCO [27] and MPII Multiperson Pose.

我們將我們的方法應用於多人姿勢估計,這是瞭解圖像中人類的重要任務。具體地,給定輸入圖像,多人姿勢估計尋求檢測每個人並定位他們的身體關節。與單人姿勢不同,先前沒有人的位置或大小的假設。多人姿勢系統必須掃描整個圖像,檢測所有人及其相應的關鍵點。爲此,我們將關聯嵌入與堆疊沙漏網絡(stacked hourglass network)[31]集成在一起,爲每個身體關節生成檢測熱圖和標記熱圖,然後將具有相似標籤的身體關節分組到個人身上。實驗表明,我們的方法優於所有最近的方法,並在MS-COCO [27]和MPII Multiperson上實現了最先進的結果.

We further demonstrate the utility of our method by applying it to instance segmentation. Showing that it is straightforward to apply associative embedding to a variety of vision tasks that fit under the umbrella of detection and grouping.

我們通過將其應用於實例分割進一步證明了我們方法的實用性。表明將關聯嵌入應用於適合檢測和分組的各種視覺任務是很簡單的。

Our contributions are two fold: (1) we introduce associative embedding, a new method for single- stage, end-to-end joint detection and grouping. This method is simple and generic(通用的); it works with any network architecture that produces pixel-wise prediction; (2) we apply associative embedding to multiperson pose estimation and achieve state of the art results on two standard benchmarks.

我們的貢獻有兩個方面:(1)我們引入了關聯嵌入(associative embedding),一種用於單級,端到端聯合檢測和分組的新方法。這種方法簡單通用;它適用於任何產生像素預測的網絡架構; (2)我們將關聯嵌入應用於多人姿勢估計,並在兩個標準基準上實現最先進的結果。

 

2.Related Work

Vector Embeddings Our method is related to many prior works that use vector embeddings. Works in image retrieval have used vector embeddings to measure similarity between images [17,53]. Works in image classification, image captioning, and phrase localization have used vector embeddings to connect visual features and text features by mapping them to the same vector space [16,20,30]. Works in natural lan-

guage processing have used vector embeddings to represent the meaning of words, sentences, and paragraphs [39,32].Our work differs from these prior works in that we use vector embeddings as identity tags in the context of joint detection and grouping.

矢量嵌入:我們的方法與許多使用矢量嵌入的先前工作有關。圖像檢索中的工作使用矢量嵌入來測量圖像之間的相似性[17,53]。圖像分類,圖像字幕和短語定位中的工作使用矢量嵌入通過將視覺特徵和文本特徵映射到相同的矢量空間來連接它們[16,20,30]。自然語言處理中的工作使用向量嵌入來表示單詞,句子和段落的含義[39,32]。我們的工作與以前的工作不同之處在於我們在聯合檢測的上下文中使用向量嵌入作爲身份標記和分組。

Perceptual Organization:Work in perceptual organization aims to group the pixels of an image into regions, parts, and objects. Perceptual organization encompasses a wide range of tasks of varying complexity from figure-ground segmentation [37] to hierarchical image parsing [21]. Prior works typically use a two stage pipeline [38], detecting basic visual

units (patches, superpixels, parts, etc.) first and grouping them second. Common grouping approaches include spectral clustering [51,46], conditional random fields (e.g. [31]),and generative probabilistic models (e.g. [21]). These grouping approaches all assume predetected basic visual units and precomputed affinity measures between them but differ among themselves in the process of converting affinity measures into groups. In contrast, our approach performs detection and grouping in one stage using a generic network that includes no special design for grouping.

感知組織:感知組織中的工作旨在將圖像的像素分組爲區域,部分和對象。感知組織包含各種複雜的任務,從圖形 - 背景分割[37]到分層圖像解析[21]。先前的工作通常使用兩階段管道[38],首先檢測基本的視覺單元(補丁,超像素,部件等),然後將它們分組。常見的分組方法包括譜聚類[51,46],條件隨機場(例如[31])和生成概率模型(例如[21])。這些分組方法都假設它們之間有預先確定的基本視覺單元和預先計算的親和力度量,但在將親和力度量轉換爲組的過程中它們之間各不相同。相比之下,我們的方法使用通用網絡在一個階段中執行檢測和分組,該網絡不包括用於分組的特殊設計。

It is worth noting a close connection between our approach to those using spectral clustering. Spectral clustering(e.g. normalized cuts [46]) techniques takes as input precomputed affinities (such as predicted by a deep network) between visual units and solves a generalized eigenproblem(特徵值問題) to produce embeddings (one per visual unit) that are similar for visual units with high affinity. Angular Embedding [37, 47] extends spectral clustering by embedding depth ordering as well as grouping. Our approach differs from spectral clustering in that we have no intermediate representation of affinities nor do we solve any eigenproblems. Instead our network directly outputs the final embeddings.

值得注意的是我們使用光譜聚類的方法之間存在密切聯繫。譜聚類(例如歸一化切割[46])技術將視覺單元之間的預先計算的親和度(例如由深度網絡預測)作爲輸入,並且解決廣義特徵問題(特徵值問題)以產生類似的嵌入(每個視覺單元一個)用於具有高親和力的視覺單元。 Angular Embedding(角度嵌入) [37,47]通過嵌入深度排序和分組來擴展譜聚類。我們的方法與光譜聚類的不同之處在於我們沒有親和力的中間表示,也沒有解決任何特徵問題。相反,我們的網絡直接輸出最終的嵌入。

Our approach is also related to the work by Harley et al. on learning dense convolutional embeddings [24], which trains a deep network to produce pixel-wise embeddings for the task of semantic(語義) segmentation(分段). Our work differs from theirs in that our network produces not only pixel-wise embeddings but also pixel-wise detection scores. Our novelty lies in the integration of detection and grouping into a single network; to the best of our knowledge such an integration has not been attempted for multiperson human pose estimation.

我們的方法也與Harley等人的工作有關。學習密集卷積嵌入(dense convolutional embedding)[24],它訓練深度網絡,爲語義(語義)分割(分段)任務產生逐像素嵌入。我們的工作與他們的不同之處在於,我們的網絡不僅產生像素嵌入,還產生像素檢測分數。我們的新穎之處在於將檢測和分組集成到單個網絡中;據我們所知,這種整合尚未嘗試用於多人人體姿勢估計。

Multiperson Pose Estimation

Recent methods have made great progress improving human pose estimation in images in particular for single person pose estimation [50, 48, 52,40, 8, 5, 41, 4, 14, 19, 34, 26, 7, 49, 44]. For multiperson pose, prior and concurrent work can be categorized as either top-down or bottom-up. Top-down approaches [42, 25, 15] first detect individual people and then estimate each person’s pose. Bottom-up approaches [45, 28, 29, 6] instead detect individual body joints and then group them into individuals. Our approach more closely resembles bottom-up approaches but differs in that there is no separation of a detection and grouping stage. The entire prediction is done at once by a single-stage, generic network. This does away with the need for complicated post-processing steps required by other methods [6, 28].

多人姿態評估:

最近的方法在改善圖像中的人體姿勢估計方面取得了很大進展,特別是對於單人姿勢估計[50,48,52,40,8,5,41,4,14,19,34,26,7,49,44] 。對於多人姿勢,先前和併發工作可以分爲自上而下或自下而上。自上而下的方法[42,25,15]首先檢測個體,然後估計每個人的姿勢。自下而上的方法[45,28,29,6]反而檢測個體關節,然後將它們分組爲個體。我們的方法更接近於自下而上的方法,但不同之處在於檢測和分組階段沒有分離。整個預測由單級通用網絡立即完成。這消除了對其他方法所需的複雜後處理步驟的需要[6,28]。

Instance Segmentation

Most existing instance segmentation approaches employ a multi-stage pipeline to do detection followed by segmentation [23,18,22,11]. Dai etal. [12] made such a pipeline differentiable (可微分)through a special layer that allows backpropagation through spatial(空間的) coordinates.

實例分段:

大多數現有的實例分割方法採用多級流水線進行檢測,然後進行分割[23,18,22,11]。戴等人。 [12]通過允許通過空間(空間)座標進行反向傳播的特殊層使這樣的管道可微分。

Two recent works have sought tighter integration of detection and segmentation using fully convolutional networks.DeepMask [43] densely scans subwindows and outputs a detection score and a segmentation mask (reshaped to a vector) for each subwindow. Instance-Sensitive FCN [10] treats each object as composed of a set of object parts in a regular grid, and outputs a per-piexl heatmap of detection scores for each object part. Instance-Sensitive FCN (IS-FCN) then detects object instances where the part detection scores are spaitally coherent, and assembles object masks from the heatmaps of object parts. Compared to DeepMask and ISFCN, our approach is substantially simpler: for each object category we output only two values at each pixel location, a score representing foreground versus background, and a tag representing the identity of an object instance, whereas both DeepMask and IS-FCN produce much higher dimensional output.

最近的兩項研究使用完全卷積網絡尋求更緊密的檢測和分割集成.DeepMask [43]密集掃描子窗口,併爲每個子窗口輸出檢測分數和分割掩模(重新形成矢量)。實例敏感FCN [10]將每個對象視爲由常規網格中的一組對象部分組成,並輸出每個對象部分的檢測分數的每個piexl熱圖。然後,實例敏感FCN(IS-FCN)檢測零件檢測得分在空間上是一致的對象實例,並從對象零件的熱圖組裝對象蒙版。與DeepMask和ISFCN相比,我們的方法非常簡單:對於每個對象類別,我們在每個像素位置僅輸出兩個值,一個表示前景與背景的分數,以及一個表示對象實例的標識的標記,而DeepMaskIS-都是FCN產生更高的尺寸輸出。

 

3. Approach

3.方法

3.1 Overview

To introduce associative embedding for joint detection and grouping, we first review the basic formulation of visual detection. Many visual tasks involve detection of a set of visual units. These tasks are typically formulated as scoring of a large set of candidates. For example, single-person human pose estimation can be formulated as scoring candidate body joint detections at all possible pixel locations. Object detection can be formulated as scoring candidate bounding boxes at various pixel locations, scales, and aspect ratios.

3.1 綜述

爲了介紹聯合檢測和分組的關聯嵌入,我們首先回顧了視覺檢測的基本方法。許多視覺任務涉及檢測一組視覺單元。這些任務通常被制定爲大量候選人的得分。例如,可以將單人人體姿勢估計公式化爲在所有可能的像素位置處的評分候選人體關節檢測。可以將對象檢測公式化爲各種像素位置,比例和縱橫比的影響比例的邊界框。

The idea of associative embedding is to predict an embedding for each candidate in addition to the detection score.The embeddings serve as tags that encode grouping: detections with similar tags should be grouped together. In multiperson pose estimation, body joints with similar tags should be grouped to form a single person. It is important to note that the absolute values of the tags do not matter, only the distances between tags. That is, a network is free to assign arbitrary values to the tags as long as the values are the same for detections belonging to the same group.

關聯嵌入的想法是除了檢測得分之外還預測每個候選的嵌入。嵌入用作編碼分組的標籤:具有相似標籤的檢測應該被分組在一起。在多人姿勢估計中,具有相似標籤的身體關節應該被分組以形成單個人。重要的是要注意標籤的絕對值是無關緊要的,只有標籤之間的差異。也就是說,只要屬於同一組的檢測值相同,網絡就可以自由地爲標籤分配任意值。

Note that the dimension of the embeddings is not critical. If a network can successfully predict high-dimensional embeddings to separate the detections into groups, it should also be able to learn to project those high-dimensional embeddings to lower dimensions, as long as there is enough network capacity. In practice we have found that 1D embedding is sufficient for multiperson pose estimation, and higher dimensions do not lead to significant improvement. Thus throughout this paper we assume 1D embeddings.

請注意,嵌入的維度並不重要如果網絡可以成功地預測高維嵌入以將檢測分成組,那麼只要有足夠的網絡容量,它也應該能夠學習將這些高維嵌入投影到較低維度。在實踐中,我們發現一維嵌入對於多人姿勢估計是足夠的,並且更高的維度不會導致顯着的改進。因此,在本文中,我們假設一維嵌入。

To train a network to predict the tags, we enforce a loss that encourages similar tags for detections from the same group and different tags for detections across different groups. Specifically(具體地), this tagging loss is enforced on candidate detections that coincide with the ground truth. We compare pairs of detections and define a penalty based on the relative values of the tags and whether the detections should be from the same group.

爲了訓練網絡來預測標籤,我們強制執行損失,鼓勵來自同一組的檢測使用類似標籤,並針對不同組檢測不同標籤。具體而言,這種標記丟失是在與基本事實一致的候選檢測上強制執行的。我們比較檢測對並根據標籤的相對值以及檢測是否應來自同一組來定義懲罰。

3.2. Stacked Hourglass Architecture

In this work we combine associative embedding with the stacked hourglass architecture [40], a model for dense pixelwise prediction that consists of a sequence of modules each shaped like an hourglass (Fig. 2). Each “hourglass” has a standard set of convolutional and pooling layers that process features down to a low resolution capturing the full context of the image. Then, these features are upsampled and gradually combined with outputs from higher and higher resolutions until reaching the final output resolution. Stacking multiple hourglasses enables repeated bottom-up and top-down inference to produce a more accurate final prediction.We refer the reader to [40] for more details of the network architecture.

3.2堆積沙漏網絡架構

在這項工作中, 我們結合了關聯嵌入與堆疊沙漏架構 [40], 密集像素預測的模型, 由一系列的模塊組成, 每個模塊的形狀都像沙漏 (圖 2)。 每個 "沙漏" 都有一組標準的卷積和池化層, 這些層處理要素的分辨率較低, 可捕獲圖像的完整上下文。然後, 對這些特徵進行向上採樣, 並逐步與來自更高分辨率和更高分辨率的輸出組合, 直到達到最終輸出分辨率。堆疊多個沙漏可以重複自下而上和自上而下的推斷, 從而產生更準確的最終預測。有關網絡體系結構的更多詳細信息, 我們請讀者參考 [40]。

The stacked hourglass model was originally developed for single-person human pose estimation. The model outputs a heatmap for each body joint of a target person. Then, the pixel with the highest heatmap activation is used as the predicted location for that joint. The network is designed to consolidate global and local features which serves to capture information about the full structure of the body while preserving fine details for precise localization. This balance between global and local features is just as important in other pixel-wise prediction tasks, and we therefore apply the same network towards both multiperson pose estimation and instance segmentation.

堆疊沙漏模型最初是爲單人人體姿勢估計而開發的。該模型爲目標人物的每個身體關節輸出熱圖。然後,具有最高熱圖激活的像素被用作該關節的預測位置。該網絡旨在整合全球和本地功能,用於捕獲有關身體完整結構的信息,同時保留精細細節以實現精確定位。全局和局部特徵之間的這種平衡在其他像素預測任務中同樣重要,因此我們將相同的網絡應用於多人姿勢估計和實例分割。

We make some slight modifications to the network architecture. We increase the number of ouput features at each drop in resolution (256 -> 386 -> 512 -> 768). In addition, individual layers are composed of 3x3 convolutions instead of residual modules, the shortcut effect to ease training is still present from the residual links across each hourglass as well as the skip connections at each resolution.

我們對網絡架構進行了一些細微的修改。我們在每次分辨率下增加輸出功能的數量(256 - > 386 - > 512 - > 768)。此外,單個層由3x3卷積而不是殘餘模塊組成,從每個沙漏的剩餘鏈接以及每個分辨率的跳過連接仍然存在簡化訓練的快捷效果。

 

 

Figure 3. An overview of our approach for producing multi-person pose estimates. For each joint of the body, the network simultaneously produces detection heatmaps and predicts associative embedding tags. We take the top detections for each joint and match them to other detections that share the same embedding tag to produce a final set of individual pose predictions.

圖3.我們製作多人姿勢估計方法的概述。對於身體的每個關節,網絡同時產生檢測熱圖並預測關聯嵌入標籤。我們對每個關節進行頂部檢測,並將它們與共享相同嵌入標記的其他檢測進行匹配,以生成最終的一組個體姿勢預測。

 

3.3. Multiperson Pose Estimation

To apply associative embedding to multiperson pose estimation, we train the network to detect joints as performed in single-person pose estimation [40]. We use the stacked hourglass model to predict a detection score at each pixel location for each body joint (“left wrist”, “right shoulder”, etc.) regardless of person identity. The difference from single-person pose being that an ideal heatmap for multiple people should have multiple peaks (e.g. to identify multiple left wrists belonging to different people), as opposed to just a single peak for a single target person.

3.3 多人姿態評估

爲了將關聯嵌入應用於多人姿勢估計,我們訓練網絡以檢測在單人姿勢估計中執行的關節[40]。我們使用堆疊沙漏模型來預測每個身體關節(“左手腕”,“右肩”等)的每個像素位置的檢測分數,而不管人的身份。與單人姿勢的區別在於多人的理想熱圖應該具有多個峯值(例如,識別屬於不同人的多個左手腕),而不是單個目標人的單個峯值。

In addition to producing the full set of keypoint detections, the network automatically groups detections into individual poses. To do this, the network produces a tag at each pixel location for each joint. In other words, each joint heatmap has a corresponding “tag” heatmap. So, if there are m body joints to predict then the network will output a total of 2m channels, m for detection and m for grouping. To parse detections into individual people, we use non-maximum suppression to get the peak detections for each joint and retrieve their corresponding tags at the same pixel location (illustrated in Fig. 3). We then group detections across body parts by comparing the tag values of detections and matching up those that are close enough. A group of detections now forms the pose estimate for a single person.

除了生成全套關鍵點檢測之外,網絡還會自動將檢測分組爲單個姿勢。爲此,網絡在每個關節的每個像素位置處生成標籤。換句話說,每個聯合熱圖具有相應的“標籤”熱圖。因此,如果有m個身體關節進行預測,那麼網絡將輸出總共2m個通道,m個用於檢測,m個用於分組。爲了將檢測解析爲個體,我們使用非最大抑制來獲得每個關節的峯值檢測,並在相同的像素位置檢索它們的相應標籤(如圖3所示)。然後,我們通過比較檢測的標記值並匹配足夠接近的標記值來對身體部位進行檢測。現在,一組檢測形成一個人的姿勢估計。

To train the network, we impose a detection loss and a grouping loss on the output heatmaps. The detection loss computes mean square error between each predicted detection heatmap and its “ground truth” heatmap which consists of a 2D gaussian activation at each keypoint location. This loss is the same as the one used by Newell et al. [40].

爲了訓練網絡,我們在輸出熱圖上施加檢測損失和分組損失。檢測損失計算每個預測檢測熱圖與其“地面實況”熱圖之間的均方誤差,該熱圖由在每個關鍵點位置處的2D高斯激活組成。這種損失與Newell等人使用的損失相同。

The grouping loss assesses how well the predicted tags agree with the ground truth grouping. Specifically, we retrieve the predicted tags for all body joints of all people at their ground truth locations; we then compare the tags within each person and across people. Tags within a person should be the same, while tags across people should be different.

分組損失評估預測標籤與地面真實分組的一致程度。具體來說,我們檢索所有人在其地面真實位置的所有身體關節的預測標籤;然後,我們比較每個人和人之間的標籤。一個人內的標籤應該是相同的,而人們之間的標籤應該是不同的。

 

{在這裏做一個特殊說明:在統計學和機器學習中ground truth 表示有監督學習的訓練集和分類準確性,用於證明或者推翻某個假設.有監督的機器學習會對訓練數據打標記,試想一下,如果訓練標記錯誤,那麼將會對測試數據的預測產生影響,因此這裏將那些正確打標記的數據稱爲ground truth}

Rather than enforce the loss across all possible pairs of keypoints, we produce a reference embedding for each person. This is done by taking the mean of the output embeddings of the person’s joints. Within an individual, we compute the squared distance between the reference embedding and the predicted embedding for each joint. Then, between pairs of people, we compare their reference embeddings to each other with a penalty that drops exponentially to zero as the distance between the two tags increases.

我們不是在所有可能的關鍵點對之間強制執行損失,而是爲每個人生成參考嵌入。這是通過獲取人的關節的輸出嵌入的平均值來完成的。在個體內,我們計算參考嵌入與每個關節的預測嵌入之間的平方距離。然後,在成對的人之間,我們將他們的參考嵌入彼此進行比較,並且當兩個標籤之間的距離增加時,懲罰隨着指數下降到零。

Formally, let hk ∈ R W ×H be the predicted tagging heatmap for the k-th body joint, where h(x) is a tag value at pixel location x. Given N people, let the ground truth body joint locations be T = {(xnk )}, n = 1, . . . , N, k = 1 . . . , K, where xnk is the ground truth pixel location of the k-th body joint of the n-th person.

形式上,讓hk∈RW×H是第k個人體關節的預測標記熱圖,其中h(x)是像素位置x處的標記值。給定N個人,讓地面真實身體關節位置爲T = {(xnk)},n = 1 ,….,N,k = 1,….,K,其中xnk是第n個人的第k個身體關節的基礎真實像素位置。

Assuming all K joints are annotated, the reference embedding for the nth person would be

假設所有K個關節都被註釋,則第n個人的參考嵌入將是:

 

 

 

 

 

 

The grouping loss Lg is then defined as

然後將分組損失Lg定義爲

 

 

 

 

 

 

 

To produce a final set of predictions we iterate through each joint one by one. An ordering is determined by first considering joints around the head and torso(軀幹) and gradually moving out to the limbs(四肢). We start with our first joint and take all activations above a certain threshold after non-maximum suppression. These form the basis for our initial pool of detected people.
爲了產生最終的預測集,我們逐個遍歷每個關節。通過首先考慮頭部和軀幹周圍的關節(軀幹)並逐漸移動到四肢(四肢)來確定排序。我們從第一個關節開始,在非最大抑制之後將所有激活超過某個閾值。這些構成了我們最初檢測到的人羣的基礎。

We then consider the detections of a subsequent joint. We compare the tags from this joint to the tags of our current pool of people, and try to determine the best matching between them. Two tags can only be matched if they fall within a specific threshold(設定的閾值內). In addition, we want to prioritize matching of high confidence detections. We thus perform a maximum matching where the weighting is determined by both the tag distance and the detection score. If any new detection is not matched, it is used to start a new person instance. This accounts for cases where perhaps only a leg or hand is visible for a particular person.

然後我們考慮後續關節的檢測。我們將此關節中的標記與當前人羣的標記進行比較,並嘗試確定它們之間的最佳匹配。只有兩個標籤落在特定閾值範圍內時才能匹配。此外,我們希望優先考慮高置信度檢測的匹配。因此,我們執行最大匹配,其中權重由標籤距離和檢測分數確定。如果任何新檢測不匹配,則用於啓動新的人員實例。這解釋了對於特定人可能只有腿或手可見的情況。

We loop through each joint of the body until every detection has been assigned to a person. No steps are taken to ensure anatomical correctness or reasonable spatial relation-

ships between pairs of joints. To give an impression of the types of tags produced by the network and the trivial nature of grouping we refer to Figure 4.

我們遍歷身體的每個關節,直到每個檢測分配給一個人。沒有采取任何步驟來確保關節對之間的解剖學正確性或合理的空間關係。爲了給出網絡產生的標籤類型和分組的微不足道的印象,我們參考圖4。

 

Figure 4. Tags produced by our network on a held-out validation image from the MS-COCO training set. The tag values are already well separated and decoding the groups is straightforward.

圖4.我們的網絡在MS-COCO訓練集的保持驗證圖像上生成的標籤。標籤值已經很好地分離,並且解碼組很簡單。

We then consider the detections of a subsequent joint.We compare the tags from this joint to the tags of our current pool of people, and try to determine the best matching between them. Two tags can only be matched if they fall within a specific threshold. In addition, we want to prioritize matching of high confidence detections. We thus perform a maximum matching where the weighting is determined by both the tag distance and the detection score. If any new detection is not matched, it is used to start a new person instance. This accounts for cases where perhaps only a leg or hand is visible for a particular person.

然後我們考慮後續關節的檢測。我們將這個關節中的標籤與我們當前人羣的標籤進行比較,並嘗試確定它們之間的最佳匹配。只有兩個標籤落在特定閾值範圍內時才能匹配。此外,我們希望優先考慮高置信度檢測的匹配。因此,我們執行最大匹配,其中權重由標籤距離和檢測分數確定。如果任何新檢測不匹配,則用於啓動新的人員實例。這解釋了對於特定人可能只有腿或手可見的情況。

We loop through each joint of the body until every detection has been assigned to a person. No steps are taken to ensure anatomical correctness or reasonable spatial relationships between pairs of joints. To give an impression of the types of tags produced by the network and the trivial nature of grouping we refer to Figure 4.

我們遍歷身體的每個關節,直到每個檢測分配給一個人。沒有採取措施來確保關節對之間的解剖學正確性或合理的空間關係。爲了給出網絡產生的標籤類型和分組的微不足道的印象,我們參考圖4。

While it is feasible to train a network to make pose predictions for people of all scales, there are some drawbacks. Extra capacity is required of the network to learn the necessary scale invariance(比例不變形) , and the precision of predictions for small people will suffer due to issues of low resolution after pooling. To account for this, we evaluate images at test time at multiple scales. There are a number of potential ways to use the output from each scale to produce a final set of pose predictions. For our purposes, we take the produced heatmaps and average them together. Then, to combine tags across scales, we concatenate(系列,連環) the set of tags at a pixel location into a vector v ∈ R m (assuming m scales). The decoding process does not change from the method described with scalar tag values, we now just compare vector distances.

雖然訓練網絡爲各種規模的人做出姿勢預測是可行的,但也存在一些缺點。網絡需要額外的容量才能學習必要的比例不變性,並且由於池化後的低分辨率問題,小人的預測精度將受到影響。爲了解釋這一點,我們在多個尺度的測試時評估圖像。有許多潛在的方法可以使用每個比例的輸出來產生最終的姿勢預測集。出於我們的目的,我們採用生成的熱圖並將它們平均在一起。然後,爲了跨標尺組合標籤,我們將像素位置處的標籤集合連接到矢量v∈Rm(假設m個標度)。解碼過程不會改變標量標籤值描述的方法,我們現在只是比較矢量距離。

Figure 5. To produce instance segmentations we decode the network output as follows: First we threshold on the detection heatmap, the resulting binary mask is used to get a set of tag values. By looking at the distribution of tags we can determine identifier tags for each instance and match the tag of each activated pixel to the closest identifier.

圖5.爲了生成實例分段,我們按如下方式對網絡輸出進行解碼:首先,我們對檢測熱圖進行閾值處理,生成的二進制掩碼用於獲取一組標記值。通過查看標籤的分佈,我們可以確定每個實例的標識符標籤,並將每個激活的像素的標籤與最接近的標識符相匹配。

3.4. Instance Segmentation

The goal of instance segmentation is to detect and classify object instances while providing a segmentation mask for each object. As a proof of concept we show how to apply our approach to this problem, and demonstrate preliminary results. Like multi-person pose estimation, instance segmentation is a problem of joint detection and grouping. Pixels belonging to an object class are detected, and then those associated with a single object are grouped together. For simplicity the following description of our approach assumes only one object category.

實例分割的目標是檢測和分類對象實例,同時爲每個對象提供分段掩碼。作爲概念證明,我們將展示如何將我們的方法應用於此問題,並展示初步結果。與多人姿勢估計一樣,實例分割是聯合檢測和分組的問題。檢測屬於對象類的像素,然後將與單個對象相關聯的像素分組在一起。爲簡單起見,我們的方法的以下描述僅假設一個對象類別。

Given an input image, we use a stacked hourglass network to produce two heatmaps, one for detection and one for tagging. The detection heatmap gives a detection score at each pixel indicating whether the pixel belongs to any instance of the object category, that is, the detection heatmap segments the foreground from background. At the same time, the tagging heatmap tags each pixel such that pixels belonging to the same object instance have similar tags.

給定輸入圖像,我們使用堆疊沙漏網絡生成兩個熱圖,一個用於檢測,一個用於標記。檢測熱圖在每個像素處給出指示像素是否屬於對象類別的任何實例的檢測得分,即,檢測熱圖將前景從背景分割。同時,標記熱圖標記每個像素,使得屬於同一對象實例的像素具有相似的標記。

To train the network, we supervise the detection heatmap by comparing the predicted heatmap with the ground truth heatmap (the union of all instance masks). The loss is the mean squared error between the two heatmaps. We supervise the tagging heatmap by imposing a loss that encourages the tags to be similar within an object instance and different across instances. The formulation of the loss is similar to that for multiperson pose. There is no need to do a comparison of every pixel in an instance segmentation mask. Instead we randomly sample a small set of pixels from each object instance and do pairwise comparisons across the group of sampled pixels.

爲了訓練網絡,我們通過將預測的熱圖與地面實況熱圖(所有實例掩模的並集)進行比較來監督檢測熱圖。損失是兩個熱圖之間的均方誤差。我們通過強制丟失來監督標記熱圖,這種損失鼓勵標記在對象實例中是相似的,並且跨實例不同。損失的表述類似於多人姿勢的表述。無需對實例分割掩碼中的每個像素進行比較。相反,我們從每個對象實例中隨機採樣一小組像素,並對採樣像素組進行成對比較。

Formally, let h ∈ R W ×H be a predicted W × H tagging heatmap. Let x denote a pixel location and h(x) the tag at the location, and let Sn = xkn , k = 1, . . . , K be a set of locations randomly sampled within the n-th object instance. The grouping loss Lg is defined as

形式上,令h∈RW×H是預測的W×H標記熱圖。設x表示像素位置,h(x)表示該位置的標籤,並且讓Sn = xkn,k = 1,... ,K是在第n個對象實例內隨機採樣的一組位置。分組損失Lg定義爲:

 

 

 

 

 

 

 

 

To decode the output of the network, we first threshold on the detection channel heatmap to produce a binary mask. Then, we look at the distribution of tags within this mask. We calculate a histogram(直方圖) of the tags and perform non-maximum suppression(抑制) to determine a set of values to use as identifiers for each object instance. Each pixel from the detection mask is then assigned to the object with the closest tag value. See Figure 5 for an illustration of this process.

爲了解碼網絡的輸出,我們首先在檢測通道熱圖上設置閾值以產生二進制掩碼。然後,我們查看此掩碼中標記的分佈。我們計算標籤的直方圖並執行非最大抑制,以確定一組值,以用作每個對象實例的標識符。然後將來自檢測掩模的每個像素分配給具有最接近標籤值的對象。有關此過程的說明,請參見圖5。

Note that it is straightforward to generalize from one object category to multiple: we simply output a detection heatmap and a tagging heatmap for each object category. As with multi-person pose, the issue of scale invariance is worth consideration. Rather than train a network to recognize the appearance of an object instance at every possible scale, we evaluate at multiple scales and combine predictions in a similar manner to that done for pose estimation.

請注意,可以直接從一個對象類別推廣到多個對象:我們只需爲每個對象類別輸出檢測熱圖和標記熱圖。與多人姿勢一樣,比例不變性問題值得考慮。我們不是訓練網絡以每個可能的尺度識別對象實例的外觀,而是在多個尺度上進行評估,並以與姿勢估計相似的方式組合預測。

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 6. Qualitative pose estimation results on MSCOCO validation images

圖6. MSCOCO驗證圖像的定性姿態估計結果

 

4. Experiments

4.1 Multiperson Pose Estimation

Dataset We evaluate on two datasets: MS-COCO [35] and MPII Human Pose [3]. MPII Human Pose consists of about 25k images and contains around 40k total annotated people(three-quarters of which are available for training). Evaluation is performed on MPII Multi-Person, a set of 1758 groups of multiple people taken from the test set as outlined

in [45]. The groups for MPII Multi-Person are usually a subset of the total people in a particular image, so some information is provided to make sure predictions are made

on the correct targets. This includes a general bounding box and scale term used to indicate the occupied region. No information is provided on the number of people or the scales of individual figures. We use the evaluation metric outlined by Pishchulin et al. [45] calculating average precision of joint detections.

數據集我們評估了兩個數據集:MS-COCO [35]和MPII Human Pose [3]。 MPII Human Pose由大約25,000個圖像組成,包含大約40,000個註釋人(其中四分之三可用於訓練)。評估是在MPII Multi-Person上進行的,一組1758組多人從測試集中取出[45]。 MPII多人組通常是特定圖像中總人數的子集,因此提供一些信息以確保預測是在正確的目標上進行的。這包括用於指示佔用區域的一般邊界框和比例項。沒有提供有關人數或個人數字的信息的信息。我們使用Pishchulin等人概述的評估指標。 [45]計算關節檢測的平均精度。

 

Figure 7. Here we visualize the associative embedding channels for different joints. The change in embedding predictions across joints is particularly apparent in these examples where there is significant overlap of the two target figures.

圖7.在這裏,我們可視化不同關節的關聯嵌入通道。在兩個目標圖中存在顯着重疊的這些示例中,跨關節的嵌入預測的變化尤其明顯。

MS-COCO [35] consists of around 60K training images with more than 100K people with annotated keypoints. We report performance on two test sets, a development test set

(test-dev) and a standard test set (test-std). We use the official evaluation metric that reports average precision (AP) and average recall (AR) in a manner similar to object detection

except that a score based on keypoint distance is used instead of bounding box overlap. We refer the reader to the MS- COCO website for details [1].

MS-COCO [35]由大約60K的訓練圖像組成,超過10萬人具有註釋關鍵點。我們報告了兩個測試集的性能,一個開發測試集(test-dev)和一個標準測試集(test-std)。我們使用官方評估指標,以類似於對象檢測的方式報告平均精度(AP)和平均召回(AR),除了使用基於關鍵點距離的分數而不是邊界框重疊。我們將讀者推薦到MS-COCO網站了解詳情[1]。

Implementation The network used for this task consists of four stacked hourglass modules, with an input size of 512×512 and an output resolution of 128×128. We train the network using a batch size of 32 with a learning rate of 2e-4(dropped to 1e-5 after 100k iterations) using Tensorflow [2]. The associative embedding loss is weighted by a factor of 1e-3 relative to the MSE loss of the detection heatmaps. The loss is masked to ignore crowds with sparse annotations. At test time an input image is run at multiple scales; the output detection heatmaps are averaged across scales, and the tags across scales are concatenated into higher dimensional tags. Since the metrics of MPII and MS-COCO are both sensitive to the precise localization of keypoints, following prior work [6], we apply a single-person pose model [40] trained on the same dataset to further refine predictions.

實現用於此任務的網絡由四個堆疊沙漏模塊組成,輸入大小爲512×512,輸出分辨率爲128×128。我們使用Tensorflow [2]使用32的批量訓練網絡,學習率爲2e-4(在100k迭代後降至1e-5)。相關嵌入損耗相對於檢測熱圖的MSE損失加權1e-3倍。掩蓋損失以忽略具有稀疏註釋的人羣。在測試時,輸入圖像以多個比例運行;輸出檢測熱圖按比例平均,並且跨比例的標籤被連接成更高維度的標籤。由於MPII和MS-COCO的指標對關鍵點的精確定位都很敏感,因此在之前的工作[6]之後,我們應用在同一數據集上訓練的單人姿勢模型[40]來進一步完善預測。

MPII Results Average precision results can be seen in Table 1 demonstrating an improvement over state-of-the-art methods in overall AP. Associative embedding proves to be an effective method for teaching the network to group keypoint detections into individual people. It requires no assumptions about the number of people present in the image, and also offers a mechanism for the network to express confusion of joint assignments. For example, if the same joint of two people overlaps at the exact same pixel location, the predicted associative embedding will be a tag somewhere between the respective tags of each person.

MPII結果平均精確度結果可以在表1中看到,表明對整個AP中的最新方法的改進。關聯嵌入被證明是教導網絡將關鍵點檢測分組到個人中的有效方法。它不需要對圖像中存在的人數進行假設,也提供了一種網絡表達聯合任務混淆的機制。例如,如果兩個人的相同關節在完全相同的像素位置處重疊,則預測的關聯嵌入將是每個人的相應標籤之間某處的標籤。

We can get a better sense of the associative embedding output with visualizations of the embedding heatmap (Figure 7). We put particular focus on the difference in the predicted embeddings when people overlap heavily as the severe occlusion and close spacing of detected joints make it much more difficult to parse out the poses of individual people.

我們可以通過嵌入熱圖的可視化更好地瞭解關聯嵌入輸出(圖7)。當人們嚴重遮擋時,我們特別關注預測嵌入的差異,因爲嚴重的遮擋和檢測到的關節的緊密間距使得解析個體姿勢變得更加困難。

 

MS-COCO Results Table 2 and Table 3 report our results on MS-COCO. We report results on both test-std and test-dev because not all recent methods report on test-std. We see that on both sets we achieve the state of the art performance. An illustration of the network’s predictions can be seen in Figure 6. Typical failure cases of the net- work stem from overlapping and occluded joints in cluttered scenes. Table 4 reports performance of ablated versions of our full pipeline, showing the contributions from applying our model at multiple scales and from further refinement using a single-person pose estimator. We see that simply

applying our network at multiple scales already achieves competitive performance against prior state of the art methods, demonstrating the effectiveness of our end-to-end joint detection and grouping.

MS-COCO結果表2和表3報告了我們在MS-COCO上的結果。我們在test-std和test-dev上報告結果,因爲並非所有最近的方法都報告了test-std。我們看到,在兩套設備上,我們都達到了最先進的性能。網絡預測的圖示可以在圖6中看到。網絡的典型故障情況源於雜亂場景中的重疊和閉塞關節。表4報告了我們完整管道的消融版本的性能,顯示了在多個尺度上應用我們的模型以及使用單人姿勢估計器進一步細化的貢獻。我們看到,簡單地在多個尺度上應用我們的網絡已經達到了與先前技術方法相比的競爭性能,證明了我們的端到端聯合檢測和分組的有效性。

We also perform an additional experiment on MS-COCO to gauge the relative difficulty of detection versus grouping, that is, which part is the main bottleneck of our system. We evaluate our system on a held-out set of 500 training images. In this evaluation, we replace the predicted detections with the ground truth detections but still use the predicted tags. Using the ground truth detections improves AP from 59.2 to 94.0. This shows that keypoint detection is the main bottleneck of our system, whereas the network has learned to produce high quality grouping. This fact is also supported by qualitative inspection of the predicted tag values, as shown in Figure 4, from which we can see that the tags are well separated and decoding the grouping is straightforward.

我們還對MS-COCO進行了額外的實驗,以衡量檢測與分組的相對難度,即哪個部分是我們系統的主要瓶頸。我們在一組500個訓練圖像上評估我們的系統。在此評估中,我們將預測的檢測結果替換爲地面實況檢測,但仍使用預測的標籤。使用地面實況檢測可將AP從59.2提高到94.0。這表明關鍵點檢測是我們系統的主要瓶頸,而網絡已經學會了產生高質量的分組。如圖4所示,對預測標籤值的定性檢查也支持這一事實,從中可以看出標籤分離良好,解碼分組很簡單。

 

4.2. Instance Segmentation

Dataset For evaluation we use the val split of PASCAL VOC 2012 [13] consisting of 1,449 images. Additional pretraining is done with images from MS COCO [35]. Evaluation is done using mean average precision of instance segments at different IOU thresholds. [22, 10, 36].

數據集對於評估,我們使用PASCAL VOC 2012 [13]的val分割,包括1,449張圖像。使用來自MS COCO [35]的圖像進行額外的預訓練。使用不同IOU閾值處的實例段的平均精度來完成評估。 [22,10,36]。

Implementation The network is trained in Torch [9] with an input resolution of 256 × 256 and output resolution of 64 × 64. The weighting of the associative embedding loss is lowered to 1e-4. During training, to account for scale, only objects that appear within a certain size range ar supervised, and a loss mask is used to ignore objects that are too big or

too small. In PASCAL VOC ignore regions are also defined at object boundaries, and we include these in the loss mask. Training is done from scratch on MS COCO for three days, and then fine tuned on PASCAL VOC train for 12 hours. At test time the image is evaluated at 3-scales (x0.5, x1.0, and x1.5). Rather than average heatmaps we generate instance proposals at each scale and do non-maximum suppression to remove overlapping proposals across scales. A more sophisticated approach for multi-scale evaluation is worth further exploration.

實現網絡在Torch [9]中訓練,輸入分辨率爲256×256,輸出分辨率爲64×64。關聯嵌入損耗的權重降低到1e-4。在訓練期間,爲了考慮比例,只有在特定大小範圍內出現的對象受到監督,並且丟失掩碼用於忽略太大或太小的對象。在PASCAL VOC中,忽略區域也在對象邊界處定義,並且我們將這些區域包含在損失掩碼中。在MS COCO上從頭開始培訓三天,然後在PASCAL VOC列車上進行微調12小時。在測試時,圖像以3級(x0.5,x1.0和x1.5)進行評估。我們不是通過平均熱圖來生成每個比例的實例提案,而是進行非最大限制抑制,以便跨比例刪除重疊提案。更復雜的多尺度評估方法值得進一步探索。

Results We show mAP results on the val set of PASCAL VOC 2012 in Table 4.2 along with some qualitative examples in Figure 8. We offer these results as a proof of concept that

associative embeddings can be used in this manner. We achieve reasonable instance segmentation predictions using the supervision as we use for multi-person pose. Tuning of training and postprocessing will likely improve performance, but the main takeaway is that associative embedding serves well as a general technique for disparate computer vision tasks that fall under the umbrella of detection and grouping problems.

結果我們在表4.2中顯示了PASCAL VOC 2012的val組的mAP結果以及圖8中的一些定性示例。我們提供這些結果作爲概念證明,可以以這種方式使用關聯嵌入。我們使用監督來實現合理的實例分割預測,因爲我們用於多人姿勢。調整訓練和後處理可能會提高性能,但主要的一點是,關聯嵌入很適合作爲不同計算機視覺任務的一般技術,這些任務屬於檢測和分組問題的範疇。

Figure 8. Example instance predictions produced by our system on the PASCAL VOC 2012 validation set.

5. Conclusion

In this work we introduce associative embeddings to supervise a convolutional neural network such that it can simultaneously generate and group detections. We apply this method to two vision problems: multi-person pose and instance segmentation. We demonstrate the feasibility of training for both tasks, and for pose we achieve state-of-the-art performance. Our method is general enough to be applied to other vision problems as well, for example multi-object tracking in video. The associative embedding loss can be implemented given any network that produces pixelwise predictions, so it can be easily integrated with other state-of-the-art architectures.

在這項工作中,我們引入了關聯嵌入來監督卷積神經網絡,以便它可以同時生成和分組檢測。我們將此方法應用於兩個視覺問題:多人姿勢和實例分割。我們展示了對這兩項任務進行培訓的可行性,併爲我們提供了最先進的性能。我們的方法足夠通用,也可以應用於其他視覺問題,例如視頻中的多目標跟蹤。在任何產生像素預測的網絡中都可以實現關聯嵌入損耗,因此可以輕鬆地與其他最先進的架構集成。

References

見於原版論文

arXiv:1611.05424

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章