0x00 大綱
- paper: "UnDeepVO: Monocular Visual Odometry through Unsupervised Deep
Learning " https://arxiv.org/abs/1709.06841- code: https://github.com/drmaj/UnDeepVO
- 參考:UnDeepVO:基於非監督深度學習的單目視覺里程計 https://cloud.tencent.com/developer/news/210831
-
作者提出了一個新的單目VO系統,被稱爲UnDeepVO,訓練用的是雙目(連續幀作爲雙目),測試用的是單目,所以標題用的是單目.
-
有兩點比較突出:
- 無監督
- 恢復尺度
不是很理解
-
loss
基於時間和空間的稠密信息,用左右幀的合成視圖(即空間loss)加上前後幀的合成視圖監督信號(即時間loss)
空間上的幾何一致性是指左右影像對上同名點的重投影幾何約束,時間上的幾何一致性即指單目序列影像之間同名點的重投影幾何約束。 -
[16-17]用到了光度一致性loss求depth,[18]用單目圖片序列的光度一致性求pose和depth,該文章的創新點是融入了
the 3D geometric registration loss
-
Q:爲什麼分開來訓練r和t?
A: 因爲旋轉(歐拉角表示)具有較強的非線性,比平移更加難訓練,所以爲了更好地用非監督學習訓練,我們在最後一個卷積層後用兩個有完全連接層的單獨集來分開平移和旋轉參數。這樣我們就可以引入權重歸一化旋轉和平移,從而得到更好的預測值。
Q: spatial transformer 和雙線性插值的區別
0x01 近期相關工作 & 需要查看的文獻資料
- 基於特徵的方法(屬於幾何方法):
[1] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
[2] G. Klein and D. Murray, “Parallel tracking and mapping for small AR workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on. IEEE, 2007, pp.225–234.
[3] R. Mur-Artal, J. Montiel, and J. D. Tardos, “ORB-SLAM: a versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
- 直接方法(屬於幾何方法):
[4] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2320–2327.
[5] J. Engel, T. Schops, and D. Cremers, “LSD-SLAM: Large-scale direct ¨ monocular SLAM,” in European Conference on Computer Vision (ECCV). Springer, 2014, pp. 834–849.
[6] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- 基於學習的方法
第一篇是鼻祖,以下全是有監督
-
[7] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A convolutional network for real-time 6-DOF camera relocalization,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2938–2946.
-
[8] R. Li, Q. Liu, J. Gui, D. Gu, and H. Hu, “Indoor relocalization in challenging environments with dual-stream convolutional neural net-works,” IEEE Transactions on Automation Science and Engineering, 2017.
*[9] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, “VidLoc:6-DoF video-clip relocalization,” inConference on Computer Visionand Pattern Recognition (CVPR), 2017. -
[10] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring representation learning with CNNs for frame-to-frame ego-motion estimation,” IEEE robotics and automation letters , vol. 1, no. 1, pp.18–25, 2016.
-
[11] S. Wang, R. Clark, H. Wen, and N. Trigoni, “DeepVO: Towards end-to-end visual odometry with deep recurrent convolutional neural net-works,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2043–2050.
-
[12] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “DeMoN: Depth and motion network for learning monocular stereo,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-
[13] R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “VINet:Visual-Inertial odometry as a sequence-to-sequence learning problem.”in AAAI, 2017, pp. 3995–4001.
-
[14] S. Pillai and J. J. Leonard, “Towards visual ego-motion learning in robots,” arXiv preprint arXiv:1705.10279, 2017.
無監督主要集中在深度估計,由於
spatial transformer
圖像warp
技術的提出,只有[18]涉及到了VO -
[15] M. Jaderberg, K. Simonyan, A. Zisserman, et al. , “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
-
[16] R. Garg, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision (ECCV). Springer, 2016, pp. 740–756.
-
[17] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-
[18] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
0x02 網絡相關
- 兩個網絡, 估計深度和pose, 後者用的是VGG-base
- 前後兩張圖片的深度圖表示空間loss;
PS: Specifically, for the overlapped area between two stereo images, every pixel in one image can find its correspondence in the other with horizontal distance Dp.
-
左右幀的光度一致性損失
-
視差一致性損失
-
pose一致性損失
- 深度圖和pose結合成剛體rigid表示時間loss.
-
前後幀的光度一致性損失, 通過姿態和空間變換網絡合成圖片
Q: 這裏的姿態和合成剛體流有什麼區別
-
3D Geometric Registration Loss
和點雲相關,不是很理解
k幀上的點可以通過
轉換映射到k+1幀上的點,記作,同理,k+1幀上的點也可映射到k幀上。
0x03 作者的實驗
- 很有意思的一個視頻:
Video: https://www.youtube.com/watch?v=5RdjO93wJqo&t - 一次輸入兩幀估計pose, 416*128
- 比較:SfMLearner [18], monocular VISO2-M and ORB-SLAM-M (without loop closure).
on length of 100m-800m所有的方法都沒有用閉環檢測
- 定量測評,Average translational root-mean-square error (RMSE) drift (%) and average rotational RMSE drift (◦=100m) on length of 100m-800m
- 深度定性和定量測評, 在訓練過程中從立體像對得到了真實尺度的深度圖
0x04 個人總結
-
在pose和depth的基礎上新增加了一個pose約束,用非監督的方式實現了真實尺度的單目視覺里程計
加個鬼的pose,這篇UnDeepVO就是在2017周庭輝的基礎上,將單目改成了雙目輸入,增加2017年的左右一致性loss,將兩個融合在了一起
2017 zhoutinghui,前後幀輸入,生成pose和depth:
2017 左右一致性
-
網絡的輸入和輸出還有新增的pose不是很理解透徹
能用自己的話說出圖三,就說明很透徹了
-
訓練的時候恢復了真實尺度也沒有理解透徹+1