



Our input encoding uses a 17-point neighborhood to compute the normals for the entire scene, using the well accepted plane fitting [18]. For each fragment, we anchor 2048 sample points distributed with spatial uniformity. These sample points act as keypoints and within their 30cm vicinity, they form the patch, from which we compute the local PPF encoding. Similarly, we down-sample the points within each patch to 1024 to facilitate the training as well as to increase the robustness of features to various point density and missing part. For occasional patches with insufficient points in the defined neighborhood, we randomly repeat points to ensure identical patch size. PPFNet extracts compact descriptors of dimension 64.
PPFNet is implemented in the popular Tensorflow [1]. The initialization uses random weights and ADAM [25] optimizer minimizes the loss. Our network operates simultaneously on all 2048 patches. Learning rate is set at 0.001 and exponentially decayed after every 10 epochs until 0.00001. Due to the hardware constraints, we use a batch size of 2 fragment pairs per iteration, containing 8192 local patches from 4 fragments already. This generates 2×20482 combinations for the network per batch.

(2)Real Datasets

We concentrate on real sets rather than synthetic ones and therefore our evaluations are against the diverse 3DMatch RGBD benchmark [48], in which 62 different real-world scenes retrieved from the pool of datasets Analysis-by-Synthesis [42], 7-Scenes [38], SUN3D [46], RGB-D Scenes v.2 [27] and Halber et [15]. This collection is split into 2 subsets, 54 for training and validation, 8 for testing. The dataset typically includes indoor scenes like living rooms, offices, bedrooms, tabletops, and restrooms. See [48] for details. As our input consists of only point geometry, we solely use the fragment reconstructions captured by Kinect sensor and not the color.
我們專注於真實的集合而不是合成集合,因此我們的評估是針對不同的3DMatch RGBD基準,其中從數據集Analysis-by-Synthesis [42], 7-Scenes [38], SUN3D [46], RGB-D Scenes v.2 [27] ,Halber et [15]的池中檢索了62個不同的真實世界場景。這些集合分爲2個子集,54個用來訓練和驗證,8個用來測試。數據集通常包括室內場景比如客廳、辦公室、我是、桌面和洗手間。詳情參考[48]。因爲我們的輸入只包括點幾何,我們只通過Kinect傳感器捕獲片段進行重建,而不使用顏色特徵。

(3)Can PPFNet outperform the baselines on real data?

We evaluate our method against hand-crafted baselines of Spin Images [21], SHOT [37], FPFH [34], USC [41], as well as 3DMatch [48], the state of the art deep learning based 3D local feature descriptor, the vanilla PointNet [30] and CGF [23], a hybrid hand-crafted and deep descriptor designed for compactness. To set the experiments more fair, we also show a version of 3DMatch, where we use 2048 local patches per fragment instead of 5K, the same as in our method, denoted as 3DMatch-2K. We use the provided pretrained weights of CGF [23]. We keep the local patch size same for all methods. Our evaluation data consists of fragments from 7-scenes [38] and SUN3D [46] datasets. We begin by showing comparisons without applying RANSAC to prune the false matches. We believe that this can show the true quality of the correspondence estimator. Inspired by [23], we accredit recall as a more effective measure for this experiment, as the precision can always be improved by better corresponding pruning [5, 4]. Our evaluation metric directly computes the recall by averaging the number of matched fragments across the datasets:
我們針對Spin Images [21], SHOT [37], FPFH [34], USC [41], 3DMatch [48] ,基於3D局部特徵描述符的深度學習技術發展水平,基本的PointNet和CGF的手工基準線來評估我們的方法,設計一個混合手工和深度描述符用來使得描述符變得緊湊。爲了讓實驗變得更公平,我們同樣展示了一個3DMatch的版本,其中每個片段我們用2048個局部貼片而不是5000個,這和我們的3DMatch-2K的方法相同。我們使用CGF[23]所提供的預訓練權值。我們讓所有方法保持相同的局部貼片大小。我們的評估數據來自7-scenes和SUN3D數據集。我們通過顯示比較來開始,而不是應用RANSAC減少錯誤匹配。我們相信,這可以顯示對應估計的真實質量。受到[23]的啓發,我們認可召回率作爲本次實驗更有效的方式,因爲通過更好的對應減少總可以提升精度。我們的評估度量直接通過平均數據集中匹配片段的數目來計算召回率:
where M is the number of ground truth matching fragment pairs, having at least 30% overlap with each other under ground-truth transformation T and τ1 = 10cm. (i, j) denotes an element of the found correspondence set Ω. x and y respectively come from the first and second fragment under matching. The inlier ratio is set as τ2 = 0.05. As seen from Tab. 1, PPFNet outperforms all the hand crafted counterparts in mean recall. It also shows consistent advantage over 3DMatch-2K, using an equal amount of patches. Finally and remarkably, we are able to show ∼ 2.7% improvement on mean recall over the original 3DMatch, using only ∼ 40% of the keypoints for matching. The performance boost from 3DMatch-2K to 3DMatch also indicates that having more keypoints is advantageous for matching.
其中,M是地面真實匹配片段對的真實數量,和彼此地面真值變換T和τ1 = 10cm下有至少30%的重疊。(i, j)表示找到的對應集合Ω中的一個元素。x 和 y分別來自於匹配下的第一個和第二個片段。內徑比設置爲τ2 = 0.05。正如表1中所看到的,PPFNet在平均召回率上優於所有手工選擇的同類方法。它同樣顯示比3DMatch-2K更具有相容的優勢,利用相等數量的蹄片。最後也是值得注意的是,相比於原始的3DMatch,我們在平均召回率上可以顯示大約2.7%的提升,僅用40%的關鍵點進行比配。從3DMatch-2K到3DMatch的性能提升也表明,有更多的關鍵點對於匹配也是有益的。
表1 我們在RANSAC之前對三維匹配基準進行的評估。

Our method expectedly outperforms both vanilla PointNet and CGF by 15%. We show in Tab. 2 that adding more samples brings benefit, but only up to a certain level (< 5K). For PPFNet, adding more samples also increases the global context and thus, following the advent in hardware, we have the potential to further widen the performance gap over 3DMatch, by simply using more local patches. To show that we do not cherry-pick τ2 but get gains, we also plot the recall computed with the same metric for different inlier ratios in Fig. 6(a). There, for the practical choices of τ2, PPFNet persistently remains above all others.

(4)Application to geometric registration

Similar to [48], we now use PPFNet in a broader context of transformation estimation. To do so, we plug all descriptors into the well established RANSAC based matching pipeline, in which the transformation between fragments is estimated by running a maximum of 50,000 RANSAC iterations on the initial correspondence set. We then transform the source cloud to the target by estimated 3D pose and compute the point-to-point error. This is a well established error metric [48]. Tab. 3 tabulates the results on the real datasets. Overall, PPFNet is again the top performer, while showing higher recall on a majority of the scenes and on the average. It is noteworthy that we always use 2048 patches, while allowing 3DMatch to use its original setting, 5K. Even so, we could get better recall on more than half of the scenes. When we feed 3DMatch 2048 patches, to be on par with our sampling level, PPFNet dominates performance-wise on most scenes with higher average accuracy.
和[48]類似,我們現在在一個變換估計的更廣泛背景中利用PPFNet。這樣做,我們將所有描述符添加到已建立的基於匹配流程的基於匹配流程的RANSAC中,其中,通過在初始對應集上運行一個最大值爲50,000 的RANSAC迭代器來估計片段之間的轉換。然後,我們通過估計的3D姿態轉變源點雲到目標點雲中,並且計算點對點誤差。這是一個公認的誤差度量。表3列出真實數據集上的結果。總的來說,PPFNet又一次作爲最突出表現者,同時在大多數場景和平均水平上表現出更高的召回率。值得注意的是,我們總是利用2048個貼片,同時允許3DMatch使用其原始設置5K。即使這樣,我們還是可以更好得復現超過一半的場景。當我們給3DMatch提供2048個貼片時,爲了與我們的採樣水平相當,PPFNet在大多數場景中以更高的平均精度在大多數場景中佔據優勢地位。

(5)Robustness to point density

Changes in point density, a.k.a. sparsity, is an important concern for point clouds, as it can change with sensor resolution or distance for 3D scanners. This motivates us to evaluate our algorithm against others in varying sparsity levels. We gradually decrease point density on the evaluation data and record the accuracy. Fig. 6(b) shows the significant advantage of PPFNet, especially under severe loss of density (only 6.5% of points kept). Such robustness is achieved due to the PointNet back end and the robust point pair features.

(6)How fast is PPFNet?

We found PPFNet to be lightning fast in inference and very quick in data preparation since we consume a very raw representation of data. Majority of our runtime is spent in the normal computation and this is done only once for the whole fragment. The PPF extraction is carried out within the neighborhoods of only 2048 sample points. Tab. 4 shows the average running times of different methods and ours on an NVIDIA TitanX Pascal GPU supported by an Intel Core i7 3.2GhZ 8 core CPU. Such dramatic speed-up in inference is enabled by the parallelPointNet backend and our simultaneous correspondence estimation during inference for all patches. Currently, to prepare the input for the network, we only use CPU, leaving GPU idle for more work. This part can be easily implemented on GPU to gain even further speed boosts.
自從我們消耗非常原始的點雲數據表示,我們發現PPFNet的推理速度非常快,數據準備速度也非常快。我們的大部分運行時間都花費在一般的計算中,在整個片段中只完成一次。PPF提取僅在2048個採樣點的鄰域進行。表4顯示在由Intel Core i7 3.2GhZ 8核心CPU支持的NVIDIA TitanX Pascal GPU上,不同方法和我們的方法的平均運行時間。並行PointNet後端和針對所有貼片的推理期間,我們同時進行的對應估計使得推理的這種顯著加速成爲可能。目前,爲了準備網絡的輸入,我們只用到了CPU,讓GPU空閒更多的工作。這部分可以很容易得在GPU 上實現,以獲得更進一步的速度提升。

5.1. Ablation Study

(1)N-tuple loss

We train and test our network with 3 different losses: contrastive (pair) [14], triplet [17] and our N-tuple loss on the same dataset with identical network configuration. Inter-distance distribution of correspondent pairs and non-correspondent pairs are recorded for the train/validation data respectively. Empirical results in Fig. 7 show that the theoretical advantage of our loss immediately transfers to practice: Features learned by N-tuple are better separable, i.e. non-pairs are more distant in the embedding space and pairs enjoy a lower standard deviation.N-tuples loss repels non-pairs further in comparison to contrastive and triplet losses because of its better knowledge of global correspondence relationships. Our N-tuple loss is general and thus we strongly encourage the application also to other domains such as pose estimation [44].
我們在具有相同網絡配置的相同數據上訓練和測試我們包含三種不同的損失網絡:contrastive (pair) [14], triplet [17]和N元組損失。分別爲訓練、驗證數據記錄對應對和非對應對之間的間距。圖7的實驗證明結果表明我們損失的理論優勢迅速轉移到了實踐中:通過N元組學習的特徵具有更好的分離性,即在嵌入空間中,不成對的距離更遠,成對的有較低的標準偏差。同contrastive和triplet 損失方法相比,N元組損失更排斥非配對的點,因爲它更瞭解全局的對應關係。我們的N元組損失是普遍適用的,因此我們強烈鼓勵把它同樣應用到其他領域,比如姿態估計[44]。

(2)How useful is global context for local feature extraction?

We argue that local features are dependent on the context. A corner belonging to a dining table should not share the similar local features of a picture frame hanging on the wall. A table is generally not supposed to be attached vertically on the wall. To assess the returns obtained from adding global context, we simply remove the global feature concatenation, keep the rest of the settings unaltered, and re-train and test on two subsets of pairs of fragments. Our results are shown in Tab. 5, where injecting global information into local features improves the matching by 18% in training and 7% in validation set as opposed to our baseline version of Vanilla PointNet •, which is free of global context and PPFs. Such significance indicates that global features aid discrimination and are valid cues also for local descriptors.

(3)What does adding PPF bring?

We now run a similar experiment and train two versions of our network, with/without incorporating PPF into the input. The contribution is tabulated in Tab. 5. There, a gain of 1% in training and 5% in validation is achieved, justifying that inclusion of PPF increases the discriminative power of the final features. While being a significant jump, this is not the only benefit of adding PPF. Note that our input representation is composed of 33% rotation-invariant and 66% variant representations. This is already advantageous to the state of the art, where rotation handling is completely left to the network to learn from data. We hypothesize that an input guidance of PPF would aid the network to be more tolerant to rigid transformations. To test this, we gradually rotate fragments around z-axis to 180° with a step size of 30° and then match the fragment to the non-rotated one.
As we can observe from Tab. 6, with PPFs, the feature is more robust to rotation and the ratio in matching performance of two networks opens as rotation increases. In accordance, we also show a visualization of the descriptors at Fig. 9 under small and large rotations. To assign each descriptor an RGB color, we use PCA projection from high dimensional feature space to 3D color space by learning a linear map [23]. It is qualitatively apparent that PPF can strengthen the robustness towards rotations. All in all, with PPFs we gain both accuracy and robustness to rigid transformation, the best of seemingly contradicting worlds. It is noteworthy that using only PPF introduces full invariance besides the invariance to permutations and renders the task very difficult to learn for our current network. We leave this as a future challenge. A major limitation of PPFNet is quadratic memory footprint, limiting the number of used patches to 2K on our hardware. This is, for instance, why we cannot outperform 3DMatch on fragments of Home-2. With upcoming GPUs, we expect to reach beyond 5K, the point of saturation.

  1. Conclusion

We have presented PPFNet, a new 3D descriptor tailored for point cloud input. By generalizing the contrastive loss to N-tuple loss to fully utilize available correspondence relatioships and retargeting the training pipeline, we have shown how to learn a globally aware 3D descriptor, which outperforms the state of the art not only in terms of recall but also speed. Features learned from our PPFNet is more capable of dealing with some challenging scenarios, as shown in Fig. 8. Furthermore, we have shown that designing our network suitable for set-input such as point pair features are advantageous in developing invariance properties. Future work will target memory bottleneck and solving the more general rigid graph matching problem.


還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.