Autonomous vehicles and robots need 3D information such as depth and pose to traverse paths safely and correctly. Classical methods utilize hand-crafted features that can potentially fail in challenging scenarios, such as those with low texture [1]. Although neural networks can be trained on monocular data to predict 3D structure in a supervised manner [2], depth annotation is expensive and can be hard to obtain. In contrast, recent methods have co-trained depth and pose estimation networks in a self-supervised manner [3]. Many improvements have been made for self-supervised depth and pose estimation by adding losses and constraints from classical vision [4] or utilizing newer network architectures [5, 6].
However, to fully exploit the potential of self-supervised learning for 3D vision, networks must be able to work on a diverse set of incoming data, sourced from different regions with distinct road structures, in different weather conditions, etc [1, 7]. Standard neural network training requires access to these diverse scenes from the beginning, which is not feasible for long-term deployment of models. When new unseen environments are encountered, the networks often fail to generalize due to domain shift. Directly continuing the training on this new environment also leads to ‘catastrophic forgetting’ of the previously learned information, i.e. when the network learns to predict depth and pose in the new environment, its performance on the previously observed environment(s) drops.
[1] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8977–8986, 2019.
[2] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Newcrfs: Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
[3] Hemang Chawla, Arnav Varma, Elahe Arani, and Bahram Zonooz. Multimodal scale consistency and awareness for monocular self-supervised depth estimation. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
[4] Vitor Guizilini, Rares, Ambrus, , Dian Chen, Sergey Zakharov, and Adrien Gaidon. Multi-frame self-supervised depth with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 160–170, 2022.
[5] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2485–2494, 2020.
[6] Arnav Varma., Hemang Chawla., Bahram Zonooz., and Elahe Arani. Transformers in self-supervised monocular depth estimation with unknown camera intrinsics. In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP,, pages 758–769. INSTICC, SciTePress, 2022.
[7] Hemang Chawla, Matti Jukola, Terence Brouns, Elahe Arani, and Bahram Zonooz. Crowdsourced 3d mapping: A combined multi-view geometry and self-supervised learning approach. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4750–4757. IEEE, 2020.
[8] Prashant Bhat, Bahram Zonooz, and Elahe Arani. Consistency is the key to further mitigating catastrophic forgetting in continual learning. In Conference on Lifelong Learning Agents, 2022.