[論文精讀] 實時目標追蹤算法SORT

題目:SIMPLE ONLINE AND REALTIME TRACKING

摘要

This paper explores a pragmatic approach to multiple object tracking where the main focus is to associate objects efficiently for online and realtime applications. To this end, detection quality is identified as a key factor influencing tracking performance, where changing the detector can improve tracking by up to 18.9%. Despite only using a rudimentary combination of familiar techniques such as the Kalman Filter and Hungarian algorithm for the tracking components, this approach achieves an accuracy comparable to state-of-the-art online trackers. Furthermore, due to the simplicity of our tracking method, the tracker updates at a rate of 260Hzwhich is over 20x faster than other state-of-the-art trackers.

Index Terms— Computer Vision, Multiple Object Tracking, Detection, Data Association

本文探討了一種實用的多目標跟蹤方法,旨在有效地關聯目標做在線和實時地應用。爲此,目標檢測質量被認爲是影響跟蹤性能的關鍵因素,其中改變檢測器可以改善追蹤率高達18.9%。儘管僅使用了一種將爲人熟知的卡爾曼濾波和匈牙利算法作爲跟蹤組建的初級結合,該方法實現了與現有最先進跟蹤技術相當的精度。此外,由於本文的跟蹤方法簡單化,跟蹤器以260Hz的速率做更新,比其他先進跟蹤器快了20倍。

關鍵詞——計算機視覺,多目標跟蹤,檢測,數據關聯

引言

This paper presents a lean implementation of a tracking-by-detection framework for the problem of multiple object track-ing (MOT) where objects are detected each frame and repre-sented as bounding boxes. In contrast to many batch based tracking approaches [1, 2, 3], this work is primarily targeted towards online tracking where only detections from the pre-vious and the current frame are presented to the tracker. Additionally, a strong emphasis is placed on efficiency for facilitating realtime tracking and to promote greater uptake in
applications such as pedestrian tracking for autonomous vehicles.

本文提出一種針對多目標跟蹤問題的基於檢測的追蹤框架的精益實現,其中每幀對象以邊界框形式檢測出。與許多基於批處理的跟蹤方法相比,這項工作主要針對在線跟蹤,其中只有前一幀和當前幀的檢測被呈現於跟蹤器。此外,重點要強調的是促進實時跟蹤的效率,並促進更大程度的應用,如自動駕駛中的行人跟蹤。

The MOT problem can be viewed as a data association problem where the aim is to associate detections across frames in a video sequence. To aid the data association process, trackers use various methods for modelling the motion [1, 4] and appearance [5, 3] of objects in the scene. The methods employed by this paper were motivated through observations made on a recently established visual MOT benchmark [6]. Firstly, there is a resurgence of mature data association techniques including Multiple Hypothesis Track-
ing (MHT) [7, 3] and Joint Probabilistic Data Association (JPDA) [2] which occupy many of the top positions of the MOT benchmark. Secondly, the only tracker that does not use the Aggregate Channel Filter (ACF) [8] detector is also the top ranked tracker, suggesting that detection quality could be holding back the other trackers. Furthermore, the trade-off between accuracy and speed appears quite pronounced, since the speed of most accurate trackers is considered too slow for realtime applications (see Fig. 1). With the prominence of traditional data association techniques among the top online and batch trackers along with the use of different detections
used by the top tracker, this work explores how simple MOT can be and how well it can perform.

MOT問題可看做是數據關聯問題,其目的是將視頻序列幀中的檢測結果相關聯。爲了幫助數據關聯過程,跟蹤器使用各種方法來構建場景中目標們的運動模型和外觀模型。本文采用的方法是通過建立在視覺 MOT benchmark上的觀測結果上的。首先,有很多較爲成熟的數據關聯技術,如多假設跟蹤( MHT)和聯合概率數據關聯(JPDA)佔據了很多MOT benchmark的前幾名。其次,唯一沒有使用聚合通道濾波(ACF)檢測器也是排名靠前的跟蹤器,暗示檢測質量可以壓過其他跟蹤器。此外,需要權衡準確性和速度,因爲大多數準確性高的跟蹤器都面臨實時性低的問題。如圖1所示。本文旨在探索如何簡化MOT以及性能如何優化的方法。

Keeping in line with Occam’s Razor, appearance features beyond the detection component are ignored in tracking and only the bounding box position and size are used for both motion estimation and data association.Furthermore, issues regarding short-term and long-term occlusion are also ignored,as they occur very rarely and their explicit treatment introduces undesirable complexity into the tracking framework.We argue that incorporating complexity in the form of object re-identification adds significant overhead into the tracking framework – potentially limiting its use in realtime applications.

與奧姆特剃刀原理保持一致(讀者備註:奧姆特剃刀原理是指如無必要,勿增實體),在跟蹤中檢測部位之外的其他外觀特性將會無視,而僅僅邊界框位置和尺寸大小將作爲運動估計和數據關聯。此外,短期和長期的遮擋問題也將忽略掉,因爲它們發生機率非常小,而且對它們的處理會給框架帶來不必要的複雜性。我們認爲以對象重識別的形式引入複雜性會給跟蹤帶來很大的開銷——進而潛在的限制了在實時性應用程序的使用。

This design philosophy is in contrast to many proposed visual trackers that incorporate a myriad of components to handle various edge cases and detection errors [9, 10, 11, 12].This work instead focuses on efficient and reliable handling of the common frame-to-frame associations.Rather than aiming to be robust to detection errors, we instead exploit recent advances in visual object detection to solve the detection problem directly.This is demonstrated by comparing the common ACF pedestrian detector [8] with a recent convolutional neural network (CNN) based detector [13].Additionally, two classical yet extremely efficient methods, Kalman filter [14] and Hungarian method [15], are employed to handle the motion prediction and data association components of the tracking problem respectively.This minimalistic formulation of tracking facilitates both efficiency and reliability for online tracking, see Fig. 1.In this paper, this approach is only applied to tracking pedestrians in various environments, however due to the flexibility of CNN based etectors [13], it naturally can be generalized to other objects classes.

在這裏插入圖片描述

這種設計理念與許多視覺跟蹤器形成了對比,後者包含了大量的組件來處理各種邊緣情況和檢測錯誤。本文側重於高效可靠地處理常見的幀 到幀的關聯,利用目標檢測的最新進展來解決檢測問題,不做目標檢測誤差針對性解決。此外,兩種經典而又及其有效的方法,卡爾曼濾波和匈牙利算法分別用作處理跟蹤中的運動預測和數據關聯 本文方法目前只應用在跟蹤多種環境中的行人,考慮基於CNN的靈活性,自然能夠用於其他目標類別。

The main contributions of this paper are:
• We leverage the power of CNN based detection in the context of MOT.
• A pragmatic tracking approach based on the Kalman filter and the Hungarian algorithm is presented and evaluated on a recent MOT benchmark.
• Code will be open sourced to help establish a baseline method for research experimentation and uptake in collision avoidance applications.

本文主要貢獻:

  • 本文利用基於CNN的檢測器;
  • 呈現了一種基於卡爾曼濾波和匈牙利算法的實用性跟蹤方法;
  • 將開放源代碼來幫助建立一個基準方法。

文獻回顧

Traditionally MOT has been solved using Multiple Hypothe-sis Tracking (MHT) [7] or the Joint Probabilistic Data Association (JPDA) filters [16, 2], which delay making difficult decisions while there is high uncertainty over the object assignments.The combinatorial complexity of these approaches is exponential in the number of tracked objects making them impractical for realtime applications in highly dynamic envi-
ronments.

傳統多目標檢測是採用MHT或JPDA, 當目標分配存在高度不確定性時,會延遲做出困難的決定。而且這些方法的組合複雜性伴隨着跟蹤對象的數量呈現指數級增長,這使得它們在高度動態環境中的實時應用變得不切實際。

Many online tracking methods aim to build appearance models of either the individual objects themselves [17, 18, 12] or a global model [19, 11, 4, 5] through online learning.In addition to appearance models, motion is often incorporated to assist associating detections to tracklets [1, 19, 4, 11]. When considering only one-to-one correspondences modelled as bipartite graph matching, globally optimal solutions such as the Hungarian algorithm [15] can be used [10, 20].

許多在線跟蹤方法旨在通過在線學習建立單個對象的外觀模型或全局模型。除了外觀模型,運動軌跡常常能用於幫助關聯檢測物體。當僅考慮一對一的聯繫採用二分圖匹配時,全局優化算法如匈牙利算法能夠使用。

The method by Geiger et al. [20] uses the Hungarian algorithm [15] in a two stage process.First, tracklets are formed by associating detections across adjacent frames where both geometry and appearance cues are combined to form the affinity matrix.Then, the tracklets are associated to each other to bridge broken trajectories caused by occlusion, again using both geometry and appearance cues.
This two step association method restricts this approach to batch computation. Our approach is inspired by the tracking component of [20], however we simplify the association to a single stage with basic cues as described in the next section.

Geiger等人提出的方法在兩階段過程中使用了匈牙利算法[15]。首先,軌跡是通過關聯相鄰幀之間的檢測而形成的,在這些幀中幾何結構和外觀特徵結合在一起形成關聯矩陣。其次,軌跡間彼此關聯來連接由遮擋造成的斷裂,再次使用幾何結構和外觀特徵。這種兩階段關聯方法限制了該方法的批量計算能力。本文方法受到了跟蹤組件的啓發,但我們將關聯簡化爲一個階段。

方法

The proposed method is described by the key components of detection, propagating object states into future frames, associating current detections with existing objects, and managing the lifespan of tracked objects.

該方法通過檢測、將對象狀態傳播到未來幀、將當前檢測與現有對象相關聯、管理跟蹤對象的生命週期等關鍵組件來描述。

3.1 檢測

To capitalise on the rapid advancement of CNN based detection, we utilise the Faster Region CNN (FrRCNN) detection framework [13]. FrRCNN is an end-to-end framework that consists of two stages. The first stage extracts features
and proposes regions for the second stage which then classifies the object in the proposed region. The advantage of
this framework is that parameters are shared between the two stages creating an efficient framework for detection. Addi-
tionally, the network architecture itself can be swapped to any design which enables rapid experimentation of different ar
chitectures to improve the detection performance.

爲了利用基於CNN的檢測器的快速發展,本文使用Faster RCNN 檢測器。
Faster RCNN 檢測器是一個端到端的兩階段檢測框架:第一個階段提取特徵然並提供區域給第二個階段,第二個階段在這些區域上做目標分類。這個框架的優勢是兩個階段參數是共享的,從而創建一個有效的檢測框架。此外,網絡結構本身可設計性強,可快速試驗不同架構來提高檢測性能。

Here we compare two network architectures provided with FrRCNN, namely the architecture of Zeiler and Fergus (FrRCNN(ZF)) [21] and the deeper architecture of Simonyan and Zisserman (FrRCNN(VGG16)) [22]. Throughout this work, we apply the FrRCNN with default parameters learnt for the PASCAL VOC challenge.As we are only interested in pedestrians we ignore all other classes and only pass person detection results with output probabilities greater than 50% to the tracking framework.

這裏,我們比較兩個網絡結構FrRCNN(ZF)和更深層的FrRCNN(VGG16)。我們使用在Pascal VOC上訓練得到的參數作爲默認參數。由於我們只對行人感興趣,可以無視其他類別,並且將那些檢測概率置信度高於50%的行人檢測結果輸出給跟蹤框架。

在這裏插入圖片描述
In our experiments, we found that the detection quality has a significant impact on tracking performance when comparing the FrRCNN detections to ACF detections.This is demonstrated using a validation set of sequences applied to both an existing online tracker MDP [12] and the tracker proposed here.Table 1 shows that the best detector (FrRCNN(VGG16)) leads to the best tracking accuracy for both MDP and the proposed method.

實驗發現,對比Faster RCNN和ACF檢測,檢測質量對跟蹤質量有顯著的影響。這個現象在使用視頻序列驗證集的現有在線跟蹤器MDP和本文使用的跟蹤器中均存在。表1 顯示最好的檢測器Faster RCNN(VGG16)會使得MDP和本文提出的方法的跟蹤精度都達到最佳。

3.2. 估計模型

Here we describe the object model, i.e. the representation and the motion model used to propagate a target’s identity into the next frame.We approximate the inter-frame displacements of each object with a linear constant velocity model which is independent of other objects and camera motion.The state of each target is modelled as:

這裏我們描述了目標模型,也即是說外觀模型和運動模型將會被傳到下一幀中用作目標身份識別(ID)。我們近似認爲每個目標的幀間位移滿足線性恆速模型,並且每個目標間的運動是獨立的,和相機的運動也是獨立的。那每個目標的狀態模型可以描述爲:

x = [u, v, s, r, u̇, v̇, ṡ] T ,

where u and v represent the horizontal and vertical pixel location of the centre of the target, while the scale s and r represent the scale (area) and the aspect ratio of the target’s bounding box respectively. Note that the aspect ratio is considered to be constant. When a detection is associated to a target, the detected bounding box is used to update the target state where the velocity components are solved optimally via a Kalman filter framework [14]. If no detection is associated to the target, its state is simply predicted without correction using the linear velocity model.

其中u和v代表目標中心的水平和垂直的像素位置;s和r分別表示目標的尺度(面積)和目標框bbox的比例值(注:比例r被認爲是常數);
當一個目標被檢測時,檢測框將用於更新目標的狀態,其中速度分量將用kalman濾波框架來解決。如果沒有檢測到與目標相關聯,目標的狀態用線性速度模型來預測,無矯正過程。

3.3. 數據關聯

In assigning detections to existing targets, each target’s bounding box geometry is estimated by predicting its new location in the current frame.The assignment cost matrix is then computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets. The assignment is solved optimally using the Hungarian algorithm.
Additionally, a minimum IOU is imposed to reject assignments where the detection to target overlap is less than IOU min .

在爲現有目標分配檢測時,每個目標框的bbox幾何框都是通過預測當前幀新的位置估計得到的。assignment cost matrix 分配代價矩陣通過每個檢測結果和所有現有目標的預測框 間的IOU距離計算得到。分配方法通過使用匈牙利算法得到最佳優化。此外,檢測到的IOU與預測目標物間IOU小於IOUmin閾值時,檢測的物體將被拒絕分配。

We found that the IOU distance of the bounding boxes implicitly handles short term occlusion caused by passing targets.
Specifically, when a target is covered by an occluding object, only the occluder is detected, since the IOU distance appropriately favours detections with similar scale.This allows both the occluder target to be corrected with the detection while the covered target is unaffected as no assignment is made.

我們發現邊界框的IOU距離能夠潛在解決因目標移動造成的短時間遮擋問題。具體地說,當目標物被遮擋物遮擋時,只有遮擋物被檢測出來,由於IOU距離適當地支持具有類似尺度目標的檢測。這使得遮擋物目標需要通過檢測得到矯正,而被遮擋的目標不受影響,因爲沒有分配任務。

3.4 創建與輸出跟蹤ID
When objects enter and leave the image, unique identities need to be created or destroyed accordingly. For creating trackers, we consider any detection with an overlap less than IOU min to signify the existence of an untracked object. The tracker is initialised using the geometry of the bounding box with the velocity set to zero. Since the velocity is unobserved at this point the covariance of the velocity component is initialised with large values, reflecting this uncertainty. Additionally, the new tracker then undergoes a probationary period where the target needs to be associated with detections to accumulate enough evidence in order to prevent tracking of false positives.

Tracks are terminated if they are not detected for T Lost frames. This prevents an unbounded growth in the number of trackers and localisation errors caused by predictions over long durations without corrections from the detector. In all experiments T Lost is set to 1 for two reasons. Firstly, the constant velocity model is a poor predictor of the true dynamics and secondly we are primarily concerned with frame-to-frame tracking where object re-identification is beyond the scope of this work. Additionally, early deletion of lost targets aids ef-
ficiency. Should an object reappear, tracking will implicitly resume under a new identity.

當目標進入和離開圖像時,唯一的ID需要創建或者銷燬。用於創建跟蹤器時,我們認爲任何檢測結果重疊小於IOUmin時,存在沒有被跟蹤的對象。 使用邊界框的幾何圖形來初始化跟蹤器,並使速度設置爲0。由於速度此時速度未被觀測到,初始速度分量的協方差很高,反應了這種不確定性。此外,新的跟蹤器需要經歷試用期,即目標物需要與檢測結果相關聯積累到足夠才能防止誤追蹤

當跟蹤器未被檢測到TLost幀時,將終止這個跟蹤器。這麼做可以防止跟蹤器數量的無線增長,以及長時間未通過檢測來得到矯正的局部誤差增長。在所有實驗中,TLost設置爲1有有兩個原因:第一,恆定速度模型在真實動力學模型中是個很差的預測模型;第二,我們主要關注幀和幀之間的跟蹤,而目標REID超出本工作範圍。此外,早期刪除目標有助於提高效率。如果目標重新出現,則會隱式分配新的ID來跟蹤。

4 實驗

在MOT數據集上做評估。
在這裏插入圖片描述

4.1 指標

• MOTA(↑): Multi-object tracking accuracy [25].

• MOTP(↑): Multi-object tracking precision [25].

• FAF(↓): number of false alarms per frame.

• MT(↑): number of mostly tracked trajectories. I.e. target has the same label for at least 80% of its life span.

• ML(↓): number of mostly lost trajectories. i.e. target is not tracked for at least 20% of its life span.

• FP(↓): number of false detections.

• FN(↓): number of missed detections.

• ID sw(↓): number of times an ID switches to a different previously tracked object [24].

• Frag(↓): number of fragmentations where a track is interrupted by miss detection

↑ :表示越高越好,↓ :表示越低越好;

4.2 性能評估

使用MOT基準[6]測試服務器評估跟蹤性能,其中11個序列的地面真實值是保留的。表2比較了提出的方法排序與其他基線跟蹤器。爲了簡單起見,只列出了最可靠的跟蹤器,即在精度方面最先進的在線跟蹤器,如(TDAM [18], MDP[12]),最快的基於批量的跟蹤器(DP NMS[23]),和所有的附近在線方法(NOMT[11])。此外,還列出了激發這種方法的方法(TBD[20]、ALEx- TRAC[5]和SMOT[1])。與其他方法相比,SORT在在線跟蹤器上獲得了最高的MOTA分數,可以與最先進的NOMT方法相媲美,後者要複雜得多,並且在不久的將來就會使用框架。另外,由於SORT的目標是集中於幀與幀之間的關聯,所以丟失目標的數量(ML)是最少的,儘管與其他跟蹤器有類似的假陰性結果。此外,由於SORT側重於通過幀與幀之間的關聯來增長tracklets,因此與其他方法相比,它丟失目標的數量最少。

4.3 實時性

大多數MOT解決方案的目標是將性能推向更高的準確性,但往往以運行時性能爲代價。雖然在離線處理任務時可以容忍較慢的運行速度,但對於機器人和自動駕駛車輛來說,實時性能是必不可少的。圖1顯示了MOT基準[6]上的許多跟蹤器的速度和準確性。這表明方法實現最好的ac -副牧師的職務也往往是最慢的(圖1中右下角)。在另一端的頻譜的最快方法較低精度(圖1中左上角)。結合了兩個可取的屬性,速度和準確性,沒有典型的缺點在圖1(右上角)。跟蹤組件運行在260 hz英特爾i7 2.5 ghz的單核機16 GB的內存。

參考文獻

論文地址:https://arxiv.org/abs/1602.00763
github地址:https://github.com/abewley/sort

https://blog.csdn.net/c20081052/article/details/86500352
https://blog.csdn.net/zjc910997316/article/details/83962462

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章