Moving Obatacle Detection in Highly Dynamic Scenes論文翻譯

Moving Obatacle Detection in Highly Dynamic Scenes

高動態場景中的移動障礙物檢測

  1. Ess, B. Leibe, K. Schindler, L. van Gool

Abstract— We  address  the  problem  of  vision-based  multiperson  tracking  in  busy  pedestrian  zones  using  a  stereo  rig mounted on a mobile platform. Specifically, we are interested in   the   application   of   such   a   system   for   supporting   path planning algorithms in the avoidance of dynamic obstacles. The complexity of the problem calls for an integrated solution, which extracts as much visual information as possible and combines it through cognitive feedback.  We propose such an approach, which  jointly  estimates  camera  position,  stereo  depth,  object detections,  and  trajectories  based  only  on  visual  information. The  interplay  between  these  components  is  represented  in  a graphical model. For each frame, we first estimate the ground  surface  together  with  a  set  of  object  detections.  Based  on  these results, we then address object interactions and estimate

trajectories. Finally, we employ the tracking results to predict  future  motion  for  dynamic  objects  and  fuse  this  information  with  a  static  occupancy  map  estimated  from  dense  stereo. The  approach  is  experimentally  evaluated  on  several  long and challenging video sequences from busy innercity locations  recorded  with  different  mobile  setups.  The  results  show  that  the  proposed  integration  makes  stable  tracking  and  motion  prediction  possible,  and  thereby  enables  path  planning  in  complex and highly dynamic scenes.

摘要:

       我們使用安裝在移動平臺上的立體聲解決了繁忙步行區中基於視覺的多人跟蹤問題。具體而言,我們感興趣的是應用這種系統來支持路徑規劃算法以避免動態障礙。問題的複雜性需要一個集成的解決方案,儘可能多地提取視覺信息,並通過認知反饋進行組合。我們提出了這樣一種方法,它僅基於視覺信息聯合估計相機位置,立體深度,物體檢測和軌跡。這些組件之間的相互作用以圖形模型表示。對於每個幀,我們首先估計地面以及一組對象檢測。基於這些結果,我們然後解決對象交互和估計軌跡。最後,我們使用跟蹤結果來預測動態對象的未來運動,並將此信息與從密集立體聲估計的靜態佔用圖融合。該方法通過實驗評估來自繁忙的市中心位置的幾個時間較長而且具有挑戰性的視頻序列,記錄有不同的移動設置。結果表明,所提出的集成使得穩定的跟蹤和運動預測成爲可能,從而使路徑規劃不復雜和高度動態的場景成爲可能。

  1. introduction

For reliable autonomous navigation, a robot or car requires appropriate information about both its static and dynamic environment. While remarkable successes have been achieved in relatively clean highway traffic situations [3] and other largely pedestrian-free scenarios such as the DARPA Urban Challenge [6], highly dynamic situations in busy city centers still pose considerable challenges for state-of-the-art approaches.

爲了實現可靠的自動導航,機器人或汽車需要有關其靜態和動態環境的適當信息。雖然在相對清潔的公路交通情況下,如[3]和其他基本沒有行人的情況下,如DARPA城市挑戰賽[6],已經取得了顯著的成功,但在繁忙的城市中心,高度動態的情況仍然對最先進的方法構成了相當大的挑戰。

For successful path planning in such scenarios where multiple independent motions and frequent partial occlusions abound, it is vital to extract semantic information about individual scene objects. Consider for example the scene depicted in the top left corner of Fig. 1. When just using depth information from stereo or LIDAR, an occupancy map would suggest little free space for driving (bottom left). However, as can be seen in the top right image (taken one second later), the pedestrians free up their occupied space soon after, which would thus allow a robotic platform to pass through without unnecessary and possibly expensive replanning. The difficulty is to correctly assess such situations in complex real-world settings, detect each individual scene object, predict its motion, and infer a dynamic obstacle map from the estimation results (bottom right). This task is made challenging by the extreme degree of clutter, appearance variability, abrupt motion changes, and the large number of independent actors in such scenarios.

對於在多個獨立運動和頻繁部分遮擋比比皆是的情況下成功的路徑規劃,提取關於各個場景對象的語義信息是至關重要的。考慮例如圖1左上角所示的場景。當僅使用來自立體聲或激光雷達的深度信息時,佔用地圖將表明很少有用於駕駛的自由空間(左下)。然而,正如右上圖(一秒鐘後拍攝)中所見,行人很快就釋放了他們佔用的空間,因此這將允許機器人平臺通過而無需不必要且可能昂貴的重新計劃。難點在於在複雜的現實世界設置中正確評估這種情況,檢測每個單獨的場景對象,預測其運動,並從估計結果推斷出動態障礙物圖(右下)。在極端程度的混亂,外觀變化,突然的運動變化以及此類情景中的大量獨立參與者中,這項任務具有挑戰性。

Fig. 1.  A static occupancy map (bottom left) can erroneously suggest no free space for navigation, even though space is actually freed up a second later (top right). By using the semantic information from an appearance-based multi-person tracker, we can cast predictions about each tracked person’s future motion. The resulting dynamic obstacle map (bottom right) correctly shows sufficient free space, as the persons walk on along their paths.

圖1.靜態佔用圖(左下角)可能錯誤地建議沒有可用於導航的空間,即使空間實際上在一秒後被釋放(右上)。通過使用來自基於外觀的多人跟蹤器的語義信息,我們可以對每個被跟蹤者的未來動作進行預測。由此產生的動態障礙物圖(右下)正確地顯示了足夠的自由空間,因爲人們沿着他們的路徑行走。

In this paper, we propose a purely vision-based approach to address this task. Our proposed system uses as input the video streams from a synchronized, forward-looking camera pair. To analyze this data, the system combines visual object detection and tracking capabilities with continuous self-localization by visual odometry and with 3D mapping based on stereo depth. Its results can be used directly as additional input for existing path planning algorithms to support dynamic obstacles. Key steps of our approach are the use of a state-of-the-art object recognition approach for identifying an obstacle’s category, as well as the reliance on a robust multi-hypothesis tracking framework employing model selection to handle the complex data association problems that arise in crowded scenes. This allows our system to apply category-specific motion models for robust tracking and prediction.

在本文中,我們提出了一種純粹的基於視覺的方法來解決這一任務。我們提出的系統使用來自同步的前視攝像機對的視頻流作爲輸入。爲了分析這些數據,該系統將視覺對象檢測和跟蹤功能通過視覺測距和基於立體視覺的3D投射技術的連續自定位相結合。其結果可以直接用作現有路徑規劃算法的附加輸入,以支持動態障礙。我們的方法的關鍵步驟是使用最先進的物體識別方法來識別障礙物的類別,以及依賴強大的多假設跟蹤框架,採用模型選擇來處理出現的複雜數據關聯問題在擁擠的場景中。這允許我們的系統應用類別特定的運動模型以進行穩健的跟蹤和預測。

In order to cope with the challenges of real-world operation, we additionally introduce numerous couplings(耦合) and feedback(反饋) paths between the different components of our system. Thus, we jointly estimate the ground surface and supporting object detections and let both steps benefit from each other. The resulting detections are transferred into world coordinates with the help of visual odometry and are grouped into candidate trajectories by the tracker. Successful tracks are then again fed back to stabilize visual odometry and depth computation through their motion predictions. Finally, the results are combined in a dynamic occupancy map such as the one shown in Fig. 1(bottom right), which allows free space computation for a later navigation module.

爲了應對實際操作的挑戰,我們還在系統的不同組件之間引入了許多耦合和反饋路徑。因此,我們共同估計地面和支撐物體的檢測,並讓兩個步驟相互受益。藉助視覺里程計將得到的檢測結果轉換爲世界座標,並由跟蹤器分組爲候選軌跡。然後通過其運動預測再次反饋成功的軌道以穩定視覺測距和深度計算。最後,將結果組合在動態佔用圖中,例如圖1(右下)所示的圖,其允許後續導航模塊的自由空間計算。

The main contribution of this paper is to show that vision based sensing has progressed sufficiently for such a system to become realizable. Specifically, we focus on tracking-by-detection of pedestrians in busy inner-city scenes, as this is an especially difficult but very important application area of future robotic and automotive systems. Our focus on vision alone does not preclude the use of other sensors such as LIDAR or GPS/INS—in any practical robotic system those sensors have their well-deserved place, and their integration can be expected to further improve performance. However, the richness of visual input makes it possible to infer very detailed semantic information about the target scene, and the relatively low sensor weight and cost make vision attractive for many applications.

本文的主要貢獻是表明基於視覺的傳感已經取得了足夠的進展,使這樣的系統變得可實現。具體而言,我們專注於在繁忙的市中心場景中檢測行人,因爲這是未來機器人和汽車系統的一個特別困難但非常重要的應用領域。我們對視覺的關注並不排除在任何實際的機器人系統中使用其他傳感器,如激光雷達或GPS / INS,這些傳感器都有其當之無愧的地位,並且可以期望它們的集成能夠進一步提高性能。然而,視覺輸入的豐富性使得可以推斷關於目標場景的非常詳細的語義信息,並且相對低的傳感器重量和成本使得視覺對於許多應用而言是有吸引力的。

The paper is structured as follows: the upcoming section reviews previous work. Section III then gives an overview of the the different components of our vision system with a focus on pedestrian tracking, before Section IV discusses its application to the generation of dynamic occupancy maps.

Implementation details are given in Section V. Finally, we present experimental results on challenging urban scenarios in Section VI, before the paper is concluded in Section VII.

本文的結構如下:即將開始的部分回顧以前的工作。第三部分概述了我們的視覺系統的不同組成部分,重點是行人跟蹤,第四部分討論了它在動態佔用圖的生成中的應用。第五部分給出了實現細節。最後,我們提出實驗結果在第VII部分結束論文之前,第六部分對具有挑戰性的城市情景進行了研究。

II. RELATED WORK

Obstacle avoidance is one of the central capabilities of any autonomous mobile system. Many systems are building up occupancy maps [7] for this purpose. An exhaustive review can be found in [28]. While such techniques are geared towards static obstacles, a main challenge is to accurately detect moving objects in the scene. Such objects can be extracted independent of their category by modeling the shape of the road surface and treating everything that does not fit that model as an object (e.g. in [19], [26], [33]). However, such simple approaches break down in crowded situations where not enough of the ground may be visible. More accurate detections can be obtained by applying category-specific models, either directly on the camera images [5], [16], [25], [31], on the 3D depth information [1] or both in combination [9], [12], [27].

避障是任何自主移動系統的核心功能之一。許多系統正在爲此目的建立佔用地圖[7]。詳盡的介紹可以在[28]中找到。雖然這些技術面向靜態障礙,但主要的挑戰是準確地檢測場景中的移動物體。通過對路面的形狀進行建模並將不適合該模型的所有東西作爲對象進行處理,可以獨立於其類別提取這些對象(例如,在[19],[26],[33]中)。然而,這種簡單的方法在擁擠的情況下崩潰,其中沒有足夠的地面可見。通過直接在相機圖像[5],[16],[25],[31],3D深度信息[1]或組合[9]兩者中應用類別特定模型,可以獲得更準確的檢測,[12],[27]。

Tracking detected objects over time presents additional challenges due to the complexity of data association in crowded scenes. Targets are typically followed using classic tracking approaches such as Extended Kalman Filters (EKF), where data assignment is optimized using Multi-Hypothesis Tracking (MHT) [4], [22] or Joint Probabilistic Data Asso-ciation Filters (JPDAF) [11]. Several robust approaches have been proposed based on those components either operating on depth measurements [23], [24], [29] or as tracking-by-detection approaches from purely visual input [13], [17], [31], [32]. The approach employed in this paper is based on our previous work [17]. It works online and simultaneously optimizes detection and trajectory estimation for multiple interacting objects and over long time windows, by operating in a hypothesis selection framework.

 

III. SYSTEM

Our vision system is designed for a mobile platform equipped with a pair of forward-looking cameras. Altogether, we report experimental results for three different such platforms, shown in Fig. 2. In this paper, we only use visual appearance and stereo depth, and integrate different components for ground plane and ego-motion estimation, object detection, tracking, and occupied area prediction.

我們的視覺系統專爲配備一對前視攝像頭的移動平臺而設計。總而言之,我們報告了三種不同此類平臺的實驗結果,如圖2所示。在本文中,我們僅使用視覺外觀和立體深度,並集成不同的組件用於地平面和自我運動估計,物體檢測,跟蹤和佔地面積預測。

Fig. 2.  Mobile recording platforms used in our experiments. Note that in this paper we only employ image information from a stero camera pair and do not make use of other sensors such as GPS or LIDAR.

圖2.我們實驗中使用的移動記錄平臺。請注意,在本文中,我們僅使用來自立體相機對的圖像信息,而不使用其他傳感器,如GPS或LIDAR。

 

Fig. 3(a) gives an overview of the proposed vision system. For each frame, the blocks are executed as follows. First, a depth map is calculated and the new frame’s camera pose is predicted. Then objects are detected together with the supporting ground surface, taking advantage of appearance, depth, and previous trajectories. The output of this stage, along with predictions from the tracker, helps stabilize visual odometry, which updates the pose estimate for the platform and the detections, before running the tracker on these updated detections. As a final step, we use the estimated trajectories in order to predict the future locations for dynamic objects and fuse this information with a static occupancy map. The whole system is held entirely causal, i.e. at any point in time it only uses information from the past and present.

圖3(a)給出了所提出的視覺系統的概述。對於每個幀,塊執行如下。首先,計算深度圖並預測新幀的相機姿勢。然後利用外觀,深度和先前的軌跡,與支撐地面一起檢測物體。此階段的輸出以及來自跟蹤器的預測有助於穩定視覺測距,在更新檢測器上運行跟蹤器之前更新平臺和檢測的姿勢估計。作爲最後一步,我們使用估計的軌跡來預測動態對象的未來位置,並將此信息與靜態佔用圖融合。整個系統完全是因果關係,即在任何時間點它只使用過去和現在的信息。

Fig. 3.  (a) Flow diagram for our vision system. (b) Graphical model for

tracking-by-detection with additional depth information (see text for details).

圖3.(a我們視覺系統的流程圖。(b通過附加深度信息進行檢測的圖形模型(詳見文本)。

For the basic tracking-by-detection components, we rely on the framework described in [8]. The main contribution of this paper is to extend this framework to the prediction of future spatial occupancy for both static and dynamic objects. The following subsections describe the main system components and give details about their robust implementation.

對於基本的檢測跟蹤組件,我們依賴於[8]中描述的框架。本文的主要貢獻是將此框架擴展到預測靜態和動態對象的未來空間佔用率。以下小節描述了主要系統組件,並提供了有關其強大實現的詳細信息。

  1. Coupled Object Detection and Ground Plane Estimation

A、耦合目標檢測與地平面估計

Instead of directly using the output of an object detector for the tracking stage, we introduce scene knowledge to reduce false positives. For this, we assume a simple scene model where all objects of interest reside on a common ground plane. As a wrong estimate of this ground plane has far-reaching consequences for all later stages, we try to avoid making hard decisions here and instead model the coupling between object detections and the scene geometry probabilistically using a Bayesian network (see Fig. 3(b)). This network is constructed for each frame and models the dependencies between object hypotheses oi, object depth,di,and the ground planeπusing evidence from the image I, the depth map D, a stereo self-occlusion map O, and the ground plane evidenceπD in the depth map. Following standard graphical model notation, the plate indicates repetition of the contained parts for the number of objects n.

我們不是直接將對象檢測器的輸出用於跟蹤階段,而是引入場景知識來減少誤報。爲此,我們假設一個簡單的場景模型,其中所有感興趣的對象都位於公共地平面上。由於這個地平面的錯誤估計對所有後期階段都有深遠的影響,我們試圖避免在此做出艱難的決定,而是使用貝葉斯網絡模擬對象檢測與場景幾何之間的耦合(見圖3(b) ))。該網絡是針對每個幀構建的,並利用來自圖像I,深度圖D,立體自遮擋圖O和地平面證據πD的證據來模擬對象假設oi,對象深度di和地平面之間的依賴關係。深度圖。按照標準的圖形模型表示法,該平面表示對象數量的重複所包含的部分 n。

In this model, an object’s probability depends both on its geometric world position and size (expressed by P(oi|π)),on its correspondence with the depth map P(oi|di), and on P(I|oi), the object likelihood estimated by the object detector. The likelihood P(πD|π) of each candidate ground plane is modeled by a robust estimator taking into account the uncertainty of the inlier depth points. The prior P(π), as well as the conditional probability tables, are learned from a training set.

在該模型中,對象的概率取決於其幾何世界位置和大小(由P(oi |π)表示),與深度圖P(oi | di)的對應關係以及P(I | oi),物體檢測器估計的物體似然。考慮到內部深度點的不確定性,通過魯棒估計器對每個候選地平面的似然性P(πD|π)進行建模。先前的P(π)以及條件概率表是從訓練集中學習的。

In addition, we introduce temporal dependencies, indicated by the dashed arrows in Fig. 3(b). For the ground plane, we propagate the state from the previous frame as a temporal Prior P(π|πt−1) = (1−α)P(π)+αP(πt−1) that stabilizes the per-frame information from the depth map P(πD|π). For the detections, we add a spatial prior for object locations that are supported by tracked candidate trajectories Ht0:t−1.As shown in Fig. 3(b), this dependency is not a first-order Markov chain, but reaches many frames into the past, as a consequence of the tracking framework explained in Section III-B.

此外,我們還引入了時間依賴性,如圖3(b)中虛線箭頭所示。地平面,我們傳播狀態與前一幀的時間之前P(π|πt 1) =(1α)P(π)+αP(πt 1)穩定的每幀信息深度圖P(πD |π)。對於檢測,我們爲被跟蹤候選軌跡Ht0: t1支持的目標位置添加了一個空間先驗。如圖3(b)所示,這種依賴關係並不是一階馬爾可夫鏈,而是通過第三- b節所述的跟蹤框架,延伸到過去的許多幀。

The advantage of this Bayesian network formulation is that it can operate in both directions. Given a largely empty scene where depth estimates are certain, the ground plane can significantly constrain object detection. In more crowded situations where less of the ground is visible, on the other hand, the object detector provides sufficient evidence to assist ground plane estimation.

這種貝葉斯網絡公式的優點是它可以雙向操作。給定一個很大程度上是空的場景,其中深度估計是確定的,地平面可以顯著地約束目標檢測。另一方面,在地面不那麼明顯的擁擠情況下,目標檢測器提供了足夠的證據來輔助地面平面估計。

  1. Tracking, Prediction

After passing the Bayesian network, object detections are placed into a common world coordinate system using camera positions estimated from visual odometry. The actual tracking system follows a multi-hypotheses approach, similar to the one described in [17]. We do not rely on background modeling, but instead accumulate the detections of the current and past frames in a space-time volume. This volume is analyzed by growing many trajectory hypotheses using independent bi-directional Extended Kalman filters(EKFs) with a holonomic constant-velocity model. While the inclusion of further motion models, as e.g. done in [27], would be possible, it proved to be unnecessary in our case.

通過貝葉斯網絡後,利用視覺測地學估計的攝像機位置,將目標檢測置入一個共同的世界座標系中。實際的跟蹤系統遵循多假設方法,類似於[17]中描述的方法。我們不依賴於背景建模,而是將當前幀和過去幀的檢測累積到一個時空矩陣中。利用獨立的雙向擴展卡爾曼濾波器(EKFs)和完整的恆速模型,通過增長許多軌跡假設來分析這個矩陣。同時還包括進一步的運動模

By starting EKFs from detections at different time steps, an overcomplete set of trajectories is obtained, which is then pruned to a minimal consistent explanation using model selection. This step simultaneously resolves conflicts from overlapping trajectory hypotheses by letting trajectories compete for detections and space-time volume. In a nutshell, the pruning step employs quadratic pseudo-boolean optimization to pick the set of trajectories with maximal joint probability, given the observed evidence over the past frames. This probability.

通過從不同時間步長的檢測開始EKFs,得到了一組過完備的軌跡集,然後使用模型選擇將其修剪爲最小的一致性解釋。這一步驟通過讓軌跡競爭探測和時空體積,同時解決了重疊軌跡假設的衝突。簡而言之,剪枝步驟使用二次僞布爾優化來選擇具有最大聯合概率的軌跡集,給定在過去幀中觀察到的證據。這個概率。

•increases as the trajectories explain more detections and as they better fit the detections’ 3D location and 2D appearance through the individual contribution of each detection;

當軌跡解釋更多的檢測時,隨着軌跡的增加,通過每個檢測的單獨貢獻,軌跡更適合檢測的三維位置和二維外觀.

•decreases when trajectories are (partially) based on the same object detections through pairwise corrections to the trajectories’ joint likelihoods (these express the constraints that each pedestrian can only follow one trajectory and that two pedestrians cannot be at the same location at the same time);

當軌跡(部分)通過對軌跡的關節概率進行兩兩的修正,基於相同的目標檢測時,軌跡會減小(這表示每個行人只能沿着一條軌跡行走,而兩個行人不可能同時在同一位置)。

•decreases with the number of required trajectories through a prior favoring explanations with fewer trajectories – balancing the complexity of the explanation against its goodness-of-fit in order to avoid over-fitting(“Occam’s razor”).

爲了避免過擬合(奧卡姆剃刀原則),通過優先選擇軌跡較少的解釋來平衡解釋的複雜性和擬合優度,從而減少所需軌跡的數量。

For the mathematical details, we refer to [17]. The most important features of this method are automatic track initialization (usually, after about 5 detections) and the ability to recover from temporary track loss and occlusion.

關於數學細節,我們參考[17]。該方法最重要的特點是自動跟蹤初始化(通常在大約5次檢測之後)和從臨時的跟蹤丟失和阻塞中恢復的能力。

The selected trajectories H are then used to provide a spatial prior for object detection in the next frame. This prediction has to take place in the world coordinate system, so tracking critically depends on an accurate and smooth ego-motion(幀間運動) estimate.

選擇的軌跡H用於提供下一幀目標檢測的空間先驗。這個預測在世界座標系,所以跟蹤關鍵取決於一個準確和順利的ego-motion(幀間運動)估計。

  1. Visual Odometry

To allow reasoning about object trajectories in the world coordinate system, the camera position for each frame is estimated using visual odometry. The employed approach builds upon previous work by [8], [20]. In short, each incoming image is divided into a grid of 10×10 bins, and an approximately uniform number of points is detected in each bin using a Harris corner detector with locally adaptive thresholds. The binning encourages a feature distribution suitable for stable localization. To reduce outliers in RANSAC, we mask out corners that coincide with predicted object locations from the tracker output and are hence not deemed suitable for localization, as shown in Fig. 4.

爲了便於對物體在世界座標系中的運動軌跡進行推理,利用視覺里程法對每一幀的攝像機位置進行估計。所採用的方法建立在[8],[20]先前工作的基礎上。簡而言之,每個接收到的圖像被劃分爲一個由10x10的容器組成的網格,並且使用帶有局部自適應閾值的Harris角檢測器在每個容器中檢測出大約一致數量的點。binning支持適合於穩定本地化的特性分佈。爲了減少RANSAC中的離羣值,我們將tr中與預測目標位置相吻合的角掩碼出來

Fig. 4.   Visual odometry and occupancy maps are only based on image parts not explained by tracked objects, i.e. the parts we believe to be static. Left: original image with detected features. Right: image when features on moving objects (green) are ignored.

圖4所示。視覺檢測和佔位圖只基於圖像部分,沒有被跟蹤對象解釋,也就是我們認爲是靜態的部分。左:檢測到特徵的原始圖像。右:當運動物體(綠色)上的特徵被忽略時的圖像。

In the initial frame, stereo matching and triangulation provide a first estimate of the 3D structure. In subsequent frames, we use 3D-2D matching to get correspondences, followed by camera resection (3-point pose) with RANSAC. Old frames (t′< t−15) are discarded, along with points that are only supported by those removed frames. To guarantee robust performance, we introduce an explicit failure detection mechanism based on the covariance of the estimated camera position, as described in [8]. In case of failure, a Kalman filter estimate is used instead of the measurement, and the visual odometry is restarted from scratch. This allows us to keep the object tracker running without resetting it. While such a procedure may introduce a small drift, a locally smooth trajectory is more important for our application. In fact, driftless global localization would require additional input from other sensors such as a GPS.

在初始幀中,立體匹配和三角剖分提供了三維結構的初步估計。在後續的幀中,我們使用3D-2D匹配獲取對應,然後用RANSAC進行相機切除(三點位姿)。舊幀(t< t-15)將被丟棄,同時丟棄的還有僅由那些被刪除的幀支持的點。爲了保證魯棒性,我們引入了一種基於估計攝像機位置協方差的顯式故障檢測機制,如[8]所述。在失敗的情況下,用卡爾曼濾波估計代替測量和目視測距. 這允許我們在不重置對象跟蹤器的情況下保持對象跟蹤器的運行。雖然這樣的程序可能會引入一個小漂移,但局部平滑的軌跡對我們的應用更重要。事實上,無漂移全球定位需要其他傳感器(如GPS)的額外輸入。

IV. OCCUPANCY MAP AND FREE SPACE PREDICTION

For actual path planning, the construction of a reliable occupancy map is of utmost importance. We split this in two parts according to the static scene and the dynamically moving objects.

對於實際的路徑規劃,建立一個可靠的佔用圖是至關重要的。我們根據靜態場景和動態移動對象將其分爲兩部分。

For static obstacles, we construct a stochastic occupancy map based on the algorithm from [2]. In short, incoming depth maps are projected onto a polar grid on the ground and are fused with the integrated and transformed map from the previous frames. Based on this, free space for driving can be computed using dynamic programming. While [2] integrate entire depth maps (including any dynamic objects) for the construction of the occupancy map, we opt to filter out these dynamic parts. As in the connection with visual odometry(測距), we use the tracker prediction as well as the current frame’s detections to mask out any non-static parts. The reasons for this are twofold: first, integrating non-static objects can result in a smeared occupancy map. Second, we are not only interested in the current position of the dynamic parts, but also in their future locations. For this, we can use accurate and category-specific motion models inferred from(推斷自) the tracker.

對於靜態障礙物,我們構建了基於[2]算法的隨機佔用圖。簡而言之,傳入的深度圖被投影到地面上的極座標網格上,並與來自先前幀的集成和變換圖相融合。基於此,可以使用動態編程來計算用於駕駛的自由空間。雖然[2]整合了整個深度圖(包括任何動態對象)來構建佔用地圖,但我們選擇過濾掉這些動態部分。與視覺測距相關,我們使用跟蹤器預測以及當前幀的檢測來屏蔽任何非靜態部分。其原因有兩個:首先,集成非靜態對象可能會導致佔用地圖模糊。其次,我們不僅對動態零件的當前位置感興趣,而且對未來的位置感興趣。爲此,我們可以使用從跟蹤器推斷的準確和類別特定的運動模型。

Dynamic Obstacles. As each object selected by the tracker is modeled by an independent EKF, we can predict its future position and obtain the corresponding uncertainty C. Choosing a bound on the positional uncertainty then yields an ellipse where the object will reside with a given probability. In our experiments, a value of 99% resulted in a good compromise between safety from collision and the need to leave a navigable path for the robot to follow. For the actual occupancy map, we also have to take into consideration the object’s dimensions and, in case of an anisotropic “footprint”, the bounds for its rotation. We assume pedestrians to have a circular footprint, so the final occupancy cone can be constructed by adding the respective radius to the uncertainty ellipse. In our visualization, we show the entire occupancy cone for the next second, i.e. the volume the pedestrian is likely to occupy within that time.

動態障礙。當跟蹤器選擇的每個對象由獨立的EKF建模時,我們可以預測其未來位置並獲得相應的不確定性C.選擇位置不確定性的界限然後產生橢圓,其中對象將以給定的概率駐留。在我們的實驗中,99%的值導致了碰撞安全性和爲機器人留下可導航路徑的需求之間的良好折衷。對於實際佔用地圖,我們還必須考慮對象的尺寸,並且在各向異性“足跡”的情況下,還要考慮其旋轉的界限。我們假設行人具有圓形足跡,因此可以通過將相應半徑添加到不確定性橢圓來構造最終佔用錐。在我們的可視化中,我們顯示下一秒的整個佔用錐,即行人在該時間內可能佔據的體積。

Based on this predicted occupancy map, free space for driving can be computed with the same algorithm as in [2], but using an appropriate prediction horizon. Note that in case a person was not tracked successfully, it will still occur in the static occupancy map, as a sort of graceful degradation of the system.

基於該預測的佔用率圖,可以使用與[2]中相同的算法計算用於駕駛的自由空間,但是使用適當的預測範圍。請注意,如果沒有成功跟蹤某個人,它仍會出現在靜態佔用地圖中,這是系統的一種優雅降級。

 

V. DETAILED IMPLEMENTATION

詳細的實施

The system’s parameters were trained on a sequence with 490 frames, containing 1’578 annotated pedestrian bounding boxes. In all experiments, we used data recorded at a resolution of 640×480 pixels (bayered) at 13–14 fps, with a camera baseline of 0.4 and 0.6 meters for the child stroller and car setups, respectively.

系統的參數在490幀的序列上訓練,包含1'578個帶註釋的行人邊界框。在所有實驗中,我們使用分辨率爲640×480像素(海灣)以13-14 fps記錄的數據,分別爲兒童嬰兒車和汽車設置提供0.4和0.6米的攝像頭基線。

Ground  Plane. For training, we infer the ground plane directly from D using Least-Median-of-Squares (LMedS), with bad estimates discarded manually. Related but less general methods include e.g. the v-disparity analysis [15]. For tractability, the ground plane parameters (θ, φ, π4) are discretized into a 6×6×20 grid, with bounds inferred from the training sequences. The training sequences also serve to construct the prior distribution P(π).

地平面。對於訓練,我們使用最小中值平方(LMedS)直接從D推斷地平面,並且手動丟棄不良估計。相關但不太通用的方法包括例如視差分析[15]。爲了易處理性,地平面參數(θ,φ,π4)被離散化爲6×6×20網格,從訓練序列推斷出界限。訓練序列還用於構造先驗分佈P(π)。

Object  Hypotheses.Our system is independent of a specific detector choice. In the experiments presented here, we use a publicly available detector based on a Histogram-of-Oriented-Gradients representation [5]. The detector is run with a low confidence threshold to retain the necessary flexibility—in the context of the additional evidence we are using, final decisions based only on appearance would be premature. The range of detected scales corresponds to pedestrian heights of 60–400 pixels. The object size distribution is modeled as a Gaussian N (1.7,0.0852)[m], as in [14]. The depth distribution is assumed uniform in the system’s operating range of 0.5–30[m], respectively 60 [m] for the car setup.

對象假設。我們的系統獨立於特定的檢測器選擇。在這裏提供的實驗中,我們使用基於直方圖 - 梯度 - 直線表示的公開可用的檢測器[5]。檢測器以低置信度閾值運行以保持必要的靈活性 - 在我們使用的其他證據的背景下,僅基於外觀的最終決策爲時過早。檢測到的刻度範圍對應於60-400像素的行人高度。對象大小分佈被建模爲高斯N(1.7,0.0852)[m],如[14]中所示。對於汽車設置,假設深度分佈在系統的0.5-30 [m]的操作範圍內均勻,分別爲60 [m]。

Depth Cues. The depth map D for each frame is obtained with a publicly available, belief-propagation-based disparity estimation software [10]. All results reported in this paper are based on this algorithm. In the meantime, we have also experimented with a fast GPU-based depth estimator, which seems to achieve similar system-level accuracy. However, we still have to verify those results in practice. For verifying detections by depth measurements in the Bayesian network, we consider the agreement of the measured mean depth inside the detection bounding box with the ground-plane distance to the bounding box foot-point. As the detector’s bounding box placement is not always accurate, we allow the Bayesian network to “wiggle around” the bounding boxes slightly in order to improve goodness of fit. The final classifier for an object’s presence is based on the number of inlier depth points and is learned from training data using logistic regression.

利用可公開獲得的基於置信傳播的視差估計軟件[10]獲得每幀的深度圖D.本文報道的所有結果均基於該算法。與此同時,我們還嘗試了基於GPU的快速深度估算器,它似乎達到了類似的系統級精度。但是,我們仍然需要在實踐中驗證這些結果。爲了通過貝葉斯網絡中的深度測量來驗證檢測,我們考慮在檢測邊界框內測量的平均深度與到邊界框腳點的地平面距離的一致性。由於探測器的邊界框位置並不總是準確的,我們允許貝葉斯網絡稍微“擺動”邊界框以提高擬合度。對象存在的最終分類器基於內部深度點的數量,並使用邏輯迴歸從訓練數據中獲知。

Belief  Propagation. The network of Fig. 3 is constructed for each frame, with all variables modeled as discrete(離散的) entities and their conditional probability tables defined as described above. Inference is conducted using Pearl’s Belief Propagation [21]. For efficiency reasons, the set of possible ground planes is pruned to the 20% most promising ones (according to prior and depth information).

置信傳播。爲每個幀構建圖3的網絡,其中所有變量被建模爲離散實體,並且它們的條件概率表如上所述定義。使用Pearl's Belief Propagation [21]進行推理。出於效率原因,可能的地平面集被修剪到20%最有希望的地平面(根據先驗和深度信息)。

 

VI. RESULTS

In order to evaluate our vision system, we applied it to three test sequences, showing strolls and drives through busy pedestrian zones. The sequences were acquired with the platforms seen in Fig. 2.1 The first test sequence (“Seq. #1”), recorded with platform (a), shows a walk over a crowded square, extending over 230 frames. The second sequence (“Seq. #2”), recorded with platform (b) at considerably worse image contrast, contains 5’193 pedestrian annotations in 999 frames. The third test sequence (“Seq. #3”) consists of 800 frames and was recorded from a car passing through a crowded city center, where it had to stop a few times to let people pass. We annotated pedestrians in every fourth frame, resulting in 960 annotations for this sequence.

爲了評估我們的視覺系統,我們將其應用於三個測試序列,顯示漫步和駕駛通過繁忙的行人區。序列是用圖2.1所示的平臺獲得的第一個測試序列(SEQ)。用平臺(A)記錄的“1”,顯示在一個擁擠的廣場上漫步,延伸超過230幀。第二序列(SEQ)。#2")用平臺(b)以相當差的圖像對比度記錄,在999幀中包含5’193行人註釋。第三個測試序列(SEQ)。#3")由800幀組成,是從一輛經過擁擠城市中心的汽車上錄下來的,爲了讓人們通過,它必須停幾次。我們在每第四幀中對行人進行註釋,由此產生960個註釋。

For a quantitative evaluation, we measure bounding box overlap in each frame and plot recall over false positives per image for three stages of our system. The results of this experiment are shown in Fig. 5(left, middle). The plot compares the raw detector output, the intermediate output of the Bayesian network, and the final tracking output. As can be seen, discarding detections that are not in accordance with the scene by the Bayesian network greatly reduces false positives with hardly any impact on recall. The tracking stage additionally improves the results and in most cases achieves a higher performance than the raw detector. It should be noted, though, that a single-frame comparison is not entirely fair here, since the tracker requires some detections to initialize (losing recall) and reports tracking results through occlusions (losing precision if the occluded persons are not annotated). However, the tracking stage provides the necessary temporal information that makes the entire motion prediction system at all possible. The blue curves in Fig. 5 show the performance on all annotated pedestrians. When only considering the immediate range up to 15m distance (which is suitable for a speed of 30 km/h in inner-city scenarios), performance is considerably better, as indicated by the red curves.

對於定量評估,我們測量每個幀中的邊界框重疊,並繪製對於我們系統的三個階段的每個圖像的誤報的記錄。該實驗的結果如圖5(左,中)所示。該圖比較了原始檢測器輸出,貝葉斯網絡的中間輸出和最終跟蹤輸出。可以看出,丟棄與貝葉斯網絡不一致的檢測結果可以大大減少誤報,幾乎不會對召回產生任何影響。跟蹤階段還改善了結果,並且在大多數情況下實現了比原始檢測器更高的性能。應該注意的是,單幀比較在這裏並不完全公平,因爲跟蹤器需要一些檢測來初始化(失去召回)並通過遮擋報告跟蹤結果(如果沒有註釋被遮擋的人則丟失精度)。然而,跟蹤階段提供必要的時間信息,使得整個運動預測系統完全可能。圖5中的藍色曲線顯示了所有帶註釋的行人的表現。當僅考慮距離最遠15米的直接距離(適用於城市內場景中30 km / h的速度)時,性能要好得多,如紅色曲線所示。

To assess the suitability of our system for path planning, we investigate the precision of the motion prediction for increasing time horizons. This precision is especially interesting, since it allows to quantify the possible advantage over system modeling only static obstacles. Specifically, we compare the bounding boxes obtained from the tracker’s prediction with the actual annotations in the frame and count the fraction of false positives (1−prec). The results can be seen in Fig. 5(right). As expected, precision drops with increasing lookahead time, but stays within acceptable limits for a prediction horizon≤1s (12 frames). Note that this plot should only be taken qualitatively: a precision of 0.9 does not imply an erroneous replanning every 10th frame, as many of the predicted locations do not affect the planned path. Rather, this experiment shows that for reasonable prediction

horizons, the precision does not drop considerably.

爲了評估我們的系統在路徑規劃中的適用性,我們研究了增加時間範圍的運動預測的精度。這種精度特別有趣,因爲它可以量化僅比靜態障礙物系統建模的可能優勢。具體來說,我們將從跟蹤器預測中獲得的邊界框與幀中的實際註釋進行比較,並計算誤報的分數(1-prec)。結果見圖5(右)。正如預期的那樣,精度隨着預測時間的增加而下降,但仍然在預測範圍≤1s(12幀)的可接受範圍內。請注意,此繪圖應僅定性地進行:精度爲0.9並不意味着每10幀重新進行錯誤重放,因爲許多預測位置不會影響計劃路徑。相反,該實驗表明,對於合理的預測角度,精度不會顯着下降。

Example tracking results for Seq. #1 are shown in Fig. 6. The operating point for generating those results was the same as the one used in Fig. 5(right). Recorded on a busy city square, many people interact in this scene, moving in all directions, stopping abruptly (e.g. the first orange box), and frequently occluding each other (see e.g. the second orange box). The bounding boxes are color coded to show the tracked identities (due to the limited palette, some color labels repeat). Below each image, we show the inferred dynamic obstacle map in an overhead view. Static obstacles are marked in black; each tracked pedestrian is entered with its current position and the predicted occupancy cone for the next second (for standing pedestrians, this cone reduces to a circle). As can be seen, our system is able to track most of the visible pedestrians correctly and to accurately predict their future motion.

示例跟蹤結果爲SEQ。圖6示出了1。產生這些結果的操作點與圖5(右)中使用的操作點相同。記錄在繁忙的城市廣場上,許多人在這個場景中互動,向四面八方移動,突然停止(例如,第一個橙色盒子),並且經常互相遮擋(參見,例如,第二個橙色盒子)。邊界框被彩色編碼以顯示跟蹤的標識(由於調色板有限,一些顏色標籤重複)。在每個圖像下,我們顯示推斷的動態障礙地圖在俯瞰圖。靜態障礙物用黑色標示;每個被跟蹤的行人被輸入其當前位置和預測下一秒的佔用圓錐體(對於站立的行人,這個圓錐體減少爲一個圓)。可以看出,我們的系統能夠正確地跟蹤大多數可見行人,並準確地預測他們未來的運動。

Fig. 7 shows more results for Seq. #2. Note that both adults and children are identified and tracked correctly even though they differ considerably in their appearance. In the bottom row of the figure, a man in pink walks diagonally towards the camera. Without motion prediction, a following navigation module might issue an unnecessary stop here. However, our system correctly determines that he presents no danger of collision and resolves this situation. Also note how the standing woman in the white coat gets integrated into the static occupancy map as soon as she is too large to be detected. This is a safe fallback in the design of our system—when no detections are available, its results simply revert to those of a depth-integration based occupancy map.

圖7示出了SEQ的更多結果。2。請注意,成人和兒童被正確地識別和追蹤,即使他們在外表上有很大差異。在圖的底部,一個穿着粉紅色衣服的男人斜向攝像機方向走去。如果沒有運動預測,下面的導航模塊可能會發出一個不必要的停止。但是,我們的系統正確地確定他沒有發生碰撞的危險,並解決了這種情況。還要注意身穿白大衣的站立婦女一旦太大而不能被發現,如何被整合到靜態佔用地圖中。在我們的系統設計中,這是一個安全的後退——當沒有可用的檢測時,它的結果簡單地恢復到基於深度集成的佔用圖的結果。

Finally, Fig. 8 demonstrates the vision system in a car application. Compared to the previous sequences, the viewpoint is quite different, and faster scene changes result in fewer data points for creating trajectories. Still, stable tracking performance can be obtained also for quite distant pedestrians.

最後,圖8展示了汽車應用中的視覺系統。與先前的序列相比,視點是完全不同的,並且更快的場景變化導致用於創建軌跡的更少的數據點。對於相當遙遠的行人,仍然可以獲得穩定的跟蹤性能。

System Performance

Apart from the object detectors, the entire system is implemented in an integrated fashion in C/C++, with several procedures taking advantage of GPU processing. For the complex parts of Seq. #3 (15 simultaneous objects), we can achieve processing times of around 400 ms per frame on an Intel Core2 CPU 6700, 2.66GHz, NVidia GeForce 8800 (see Tab. I). While the detector stage is the current bottleneck (the detector was run offline and needed about 30 seconds per image), we want to point out that for the HOG detector, real-time GPU implementations exist [30], which could be substituted to remove this restriction.

除了對象檢測器之外,整個系統以C/C++的集成方式實現,並利用GPU處理的多個過程。對於SEQ的複雜零件。#3(15個同時存在的對象),我們可以在Intel Core2 CPU 6700、2.66GHz、NVidia GeForce 8800上實現每幀約400ms的處理時間(參見附表)。我)雖然檢測器階段是當前的瓶頸(檢測器離線運行並且每幅圖像需要大約30秒),但我們要指出的是,對於HOG檢測器,存在實時GPU實現[30],這可以被替代以去除此限制。

VII. CONCLUSION

In this paper, we have presented a mobile vision system for the creation of dynamic obstacle maps for automotive or mobile robotics platforms. Such maps should provide valuable input for actual path planning algorithms [18]. Our approach relies on a robust tracking system that closely integrates different modules (appearance-based object detection, depth estimation, tracking, and visual odometry). To resolve the complex interactions that occur between pedestrians in urban scenarios, a multi-hypothesis tracking approach is employed. The inferred predictions can then be used to extend a static occupancy map generation system to a dynamic one, which then allows for more detailed path planning. The resulting system can handle very challenging scenes and delivers accurate predictions for many simultaneously tracked objects.

在本文中,我們提出了一種移動視覺系統,用於爲汽車或移動機器人平臺創建動態障礙物圖。這些地圖應爲實際路徑規劃算法提供有價值的輸入[18]。我們的方法依賴於強大的跟蹤系統,該系統緊密集成了不同的模塊(基於外觀的物體檢測,深度估計,跟蹤和視覺測距)。爲了解決城市場景中行人之間發生的複雜相互作用,採用了多假設跟蹤方法。然後可以使用推斷的預測將靜態佔用圖生成系統擴展到動態佔用圖,然後允許更詳細的路徑規劃。最終的系統可以處理非常具有挑戰性的場景,併爲許多同時跟蹤的對象提供準確的預測。

In future work, we plan to optimize the individual system components further with respect to run-time and performance. As discussed before, system operation at 2-3 fps is already reachable now, but additional improvements are necessary for true real-time performance. In addition, we plan to improve the trajectory analysis by including more elaborate motion models and to combine it with other sensing modalities such as GPS and LIDAR.

在未來的工作中,我們計劃在運行時和性能方面進一步優化各個系統組件。如前所述,現在已經可以達到2-3fps的系統運行,但是真正的實時性能還需要進一步的改進。此外,我們計劃通過包含更精細的運動模型並將其與其他傳感模式(如GPS和激光雷達)相結合來改進軌跡分析。

REFERENCES

[1] K. O. Arras, O. M. Mozos, and W. Burgard. Using boosted features for the detection of people in 2d range data. In ICRA, 2007.

[2] H. Badino, U. Franke, and R. Mester. Free space computation using stochastic occupancy grids and dynamic programming.  In ICCV Workshop on Dynamical Vision (WDV) , 2007.

[3] M. Betke, E. Haritaoglu, and L. S. Davis. Real-time multiple vehicle tracking from a moving vehicle. MVA , 12(2):69–83, 2000.

[4] I. J. Cox. A review of statistical data association techniques for motion correspondence. IJCV , 10(1):53–66, 1993.

[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.

[6] DARPA.    DARPA  urban  challenge  rulebook.    In Webpage , 2008. http://www.darpa.mil/GRANDCHALLENGE/docs/Urban_Challenge_Rules_102707.pdf.

[7] A. Elfes.  Sonar-based real-world mapping and navigation. IEEE Journal of Robotics and Automation, 3(3):249–265, 1987.

[8] A. Ess, B. Leibe, K. Schindler, and L. van Gool. A mobile vision system for robust multi-person tracking. In CVPR , 2008.

[9] A. Ess, B. Leibe, and L. van Gool. Depth and appearance for mobile scene analysis. In ICCV , 2007.

[10] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient belief propagation

for early vision. IJCV, 70:41–54, 2006.

[11] T. E. Fortmann, Y. Bar Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association.IEEE  J. Oceanic Engineering, 8(3):173–184, 1983.

[12] D. Gavrila and V. Philomin. Real-time object detection for ”smart” vehicles. In ICCV , pages 87–93, 1999.

[13] D. M. Gavrila and S. Munder.  Multi-cue pedestrian detection and tracking from a moving vehicle. IJCV , 73:41–59, 2007.

[14] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in perspective. In CVPR, 2006.

[15] R. Labayrade, D. Aubert, and J.-P. Tarel. Real time obstacle detection on non flat road geometry through ’v-disparity’ representation. In IVS,2002.

[16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision , 77(1-3):259–289, May 2008.

[17] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool.  Coupled detection and tracking from static cameras and moving vehicles. IEEE TPAMI , 30(10):1683–1698, 2008.

[18] K. Macek, A. D. Vasquez, T. Fraichard, and R. Siegwart. Safe vehicle navigation in dynamic urban scenarios. In ITSC , 2008.

[19] S. Nedevschi, R. Danescu, D. Frentiu, T. Graf, and R. Schmidt. High accuracy stereovision approach for obstacle detection on non-planar roads. In Proc IEEE Intelligent Engineering Systems , 2004.

[20] D. Nist ́er, O. Naroditsky, and J. R. Bergen. Visual odometry. In CVPR ,2004.

[21] J. Pearl. Probabilistic  Reasoning  in  Intelligen  Systems.  Morgan Kaufmann Publishers Inc., 1988.

[22] D. B. Reid.  An algorithm for tracking multiple targets. IEEE  T. Automatic Control , 24(6):843–854, 1979.

[23] M. Scheutz, J. McRaven, and G. Cserey.  Fast, reliable, adaptive, bimodal people tracking for indoor environments. In IROS , 2004.

[24] D. Schulz, W. Burgard, D. Fox, and A. Cremers. People tracking with mobile robots using sample-based joint probabilistic data association filters. IJRR , 22(2):99–116, 2003.

[25] A. Shashua, Y. Gdalyahu, and G. Hayun.  Pedestrian detection for driving assistance systems: Single-frame classification and system level performance. In IVS , 2004.

[26] M. Soga, T. Kato, M. Ohta, and Y. Ninomiya. Pedestrian detection with stereo vision. In IEEE International Conf. on Data Engineering , 2005.

[27] L. Spinello, R. Triebel, and R. Siegwart. Multimodal people detection and tracking in crowded scenes. In Proc. of The AAAI Conference on Artificial Intelligence (Physically Grounded AI Track) , July 2008.

[28] S. Thrun. Probabilistic Robotics . The MIT Press, 2005.

[29] C.-C. Wang, C. Thorpe, and S. Thrun.  Online simultaneous localization and mapping with detection and tracking of moving objects:Theory and results from a ground vehicle in crowded urban areas. In ICRA , 2003.

[30] C. Wojek, G. Dork ́o, A. Schulz, and B. Schiele. Sliding-windows for rapid object class localization: A parallel technique. In DAGM , 2008.

[31] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet part detectors. IJCV, 75(2):247–266, 2007.

[32] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.

[33] L. Zhao and C. Thorpe. Stereo- and neural network-based pedestrian detection. In ITS , 2000.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章