文章目錄

本論文目前是KITTI排名第一，香港中文大學和商湯出品，該作者還提出了PointRCNN和Part-A $^2$ Net。

PV-RCNN

本文將Grid-based（我一般常稱爲Voxel-based）的方法和Point-based的方法優缺點結合了起來。本文首先說明了Grid-based和Point-based的方法的優缺點：
“Generally, the grid-based methods are more computationally efficient but the inevitable information loss degrades the fine- grained localization accuracy, while the point-based methods have higher computation cost but could easily achieve larger receptive field by the point set abstraction.”

網絡的結構圖如下：

RPN

Backbone: 3D Sparse Convolution

在本文中沒有介紹太多，但在作者之前的一篇文章“Part-A $^2$ Net: 3D Part-Aware and Aggregation Neural Network for Object”中介紹的比較詳細，由於是backbone，其實也比較通用。那作者爲什麼要用3D Sparse Convolution呢，作者在文中提到：“Because of its high efficiency and accuracy”

Classification & Regression Head

將3D的feature map轉爲俯視圖，高度變爲通道，然後使用每個cell每個類別設置兩個anchor，角度分別爲0和90度。

實驗表明使用這種backbone和anchor的設置方式，Recall高：“As shown in Table 4, the adopted 3D voxel CNN backbone with anchor-based scheme achieves higher recall perfor- mance than the PointNet-based approaches [25, 37]”

但這裏有個問題是anchor的角度是0或者90度，那-90度是怎麼處理的？這相當於是怎麼處理相反方向的車？車輛朝向的這個量這個在Proposal生成的過程中是否考慮？如果考慮，則怎麼迴歸相反方向的車，這種anchor設置看起來不合理；如果不考慮，那麼在通過Proposal生成6x6x6的grids的時候的順序怎麼確定，難道就一直不考慮？這個得通過具體Loss或者代碼中看了。
這一點，Part-A $^2$ Net也沒有細講，但引文可以追溯到SECOND: Sparsely Embedded Convolutional Detection。這樣子其實也可以理解，就相當於在圖像處理中，網絡要學會對左右翻轉的魯邦性。

Voxel Set Abastraction Module（VSA）

Discussion

有了Proposal，就要提取Proposal中的feature，形成一個固定大小的feature map了，本文將Proposal分成了6x6x6的柵格。那麼如何計算6x6x6的每個cell的feature呢？

然後作者提出了對目前方法不足的地方的討論：
“(i) These feature volumes are generally of low spatial resolution as they are downsampled by up to 8 times, which hinders accurate localization of objects in the input scene.
(ii) Even if one can upsample to obtain feature volumes/maps of larger spatial sizes, they are generally still quite sparse.”
也就是說使用差值的方法，類似於圖像中的目標檢測那樣的RoI Align的方法不太好。

作者就提出了一種思路，使用PointNet++中的SA層，對每個cell，使用SA層，綜合這個cell一定範圍內的BackBone輸出的feature map中的feature。但作者提出，這種方法，計算量太高。
“A naive solution of using the set abstraction operation for pooling the scene feature voxels would be directly aggregating the multi-scale feature volume in a scene to the RoI grids. However, this intuitive strategy simply occupies much memory and is inefficient to be used in practice. For instance, a common scene from the KITTI dataset might result in 18, 000 voxels in the 4×downsampled feature volumes. If one uses 100 box proposal for each scene and each
box proposal has 3 × 3 × 3 grids. The 2, 700 × 18, 000 pairwise distances and feature aggregations cannot be efficiently computed, even after distance thresholding.”
爲了解決這個問題，作者提出了VSA Module，來減少要聚集的feature的總數量，也就是上例子中的18000。

VSA Module

VSA Module在示例圖中已經畫的非常形象了。過程如下：（公式1,2,3）
1）在原始點鐘用Furthest Point Sampling選n個點
2）在每一層中的feature map中，使用SA Module綜合每個點一定鄰域內的feature
3）然後把所有feature concat起來
Extended VSA Module還多兩種feature：

在Backbone輸出的feature map轉到的BEV圖中，用2D bilinear interpolation計算的feature
使用原始點雲通過SA Module計算的feature

PKW Module（Predicted Keypoint Weighting）

問題是n個點中，有些點事前景點，比較重要，有些點是背景點，不重要。這就要區分一下，通過這n個點的feature，可以計算n個weight，weight由真實的mask做監督訓練，然後用這weight乘以點的feature，得到每個點的最終的feature。（公式5）這個過程被稱爲PKW module。

上述過程是使用n個點來表示整個場景，文章中叫做voxel-to-keypoint scene encoding，n個點叫做key-points

到此，我們有了Proposal和n個點的座標和對應的feature。

RCNN

有了Proposal，就可以生成6x6x6個cell，對於每個cell的中心點，可以在之前得到的n個點中選取那些在其鄰域的點，然後使用SA Module綜合特徵，得到cell的特徵。（公式6,7）

得到了Proposal的固定大小的特徵，就可以做confidence prediction和box refinement了。這裏要注意的是confidence prediction的真實值是由IOU給出的。

Experiments

在KITTI上和Waymo Open Dataset上效果都很好。

Ablation Studies：

驗證了voxel-to-keypoint scene encoding的有效性，與RPN和樸素的想法做了對比。
驗證了different features for VSA module。
驗證了PKW module的有效性。
驗證了RoI-grid pooling module比RoI-aware pooling module（PointRCNN中的方法）的有效性。

思考

本論文的作者之前的論文還有PointRCNN，Part-A $^2$ Net。本文主要引用STD，這裏就做一下對比，看看網絡的發展脈絡。

PointRCNN完全使用PointNet++做特徵提取的module，包括RPN中的backbone和RCNN中的特徵提取部分。

STD相比於PointCNN，加入了RoI-grid的部分。由於RCNN中使用voxel表示的，RCNN中的特徵提取也變成了3D Convolution。

Part-A $^2$ Net，相比於STD，一開始就是用Voxel的表示方法，將RPN中的主幹網絡也換成3D Convolution。（當然還有提出了Part location的表示等等）拋開細節的特徵表示不談，我認爲其實Part-A $^2$ Net就是本文中樸素的想法。

PV-RCNN解決了本文提出的Part-A $^2$ Net計算效率低的問題。

本文閱讀起來非常舒服，就像在講一個故事，寫論文的方法也可以很好的借鑑本文。

【論文閱讀】【三維目標檢測】PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

文章目錄

PV-RCNN

RPN

Backbone: 3D Sparse Convolution

Classification & Regression Head

Voxel Set Abastraction Module（VSA）

Discussion

VSA Module

PKW Module（Predicted Keypoint Weighting）

RCNN

Experiments

思考

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

【論文閱讀】【綜述】從Optical Flow到Scene Flow

【代碼閱讀】詳解在Pytorch中定義自己寫的CUDA編程函數

【論文閱讀】【綜述】3D Object Detection 3D目標檢測綜述

【論文閱讀】【三維目標檢測】Pseudo-LiDAR from Visual Depth Estimation

【論文閱讀】【綜述】A survey of deep learning techniques for autonomous driving

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結