3D Packing for Self-Supervised Monocular Depth Estimation
by Vitor Guizilini, pdf at arxiv, 2020
- Depth estimator fD : I → D
- Ego motion estimator: fx : (It, IS) → xt → S
They predict an inverse depth and use a packnet architecture.
Inverse depth probably has more stable results. Points far away from camera have small inverse depth that with low precision. The nearer points have more information
Probably they assume that the scene is rigid and there is no moving objects. It is likely to give some errors for the moving objects. How do they deal with the moving objects?
Ego motion estimator
They use a rather simple CNN from SfMLearner.
Loss function consists of three parts. 1. appearance loss 2. depth smoothness loss 3. velocity scaling loss
paper: Unsupervised Learning of Depth and Ego-Motion from Video by Berkley and google
One of the previous works, that became a foundation for 3d-pack.
Implementation in pytorch: SfmLearner-Pytorch
Depth from Videos in the Wild
Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras by Ariel Gordon et al, 2019.
Here they learn not only depth and R/t but also intrinsics of the camera
Depth Prediction Without the Sensors: Leveraging Structure for UnsupervisedLearning from Monocular Videos
Paper by google
based on vid2depth
another paper by google.
based on sfm learner
monodepth2 - Digging Into Self-Supervised Monocular Depth Estimation
some method that 3D Packing use as a competitor.
RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes
self-supervised from stereo and mono.
They claim to be better than monodepth2 and mode generalized than depth_from_video_in_the_wild "Depth from Videos in the Wild". However they require camera calibration and median depth to be estimated prior to processing with an external tool. (COLMAP). Not really "in the wild".
They were able to train on multiple scenes with different depth ranges. The method still requires a static scene for the training. For example they use data from Mannequin Challenge to train their models. Then the network can be used on dynamic scenes with people.
The code is not available so far.