3D Packing for Self-Supervised Monocular Depth Estimation

by Vitor Guizilini, pdf at arxiv, 2020


  1. Depth estimator fD : I → D
  2. Ego motion estimator: fx : (It, IS) → xt → S

Depth Estimator

They predict an inverse depth and use a packnet architecture.

Inverse depth probably has more stable results. Points far away from camera have small inverse depth that with low precision. The nearer points have more information

Probably they assume that the scene is rigid and there is no moving objects. It is likely to give some errors for the moving objects. How do they deal with the moving objects?

Ego motion estimator

They use a rather simple CNN from SfMLearner.

Loss function

Loss function consists of three parts. 1. appearance loss 2. depth smoothness loss 3. velocity scaling loss


paper: Unsupervised Learning of Depth and Ego-Motion from Video by Berkley and google



One of the previous works, that became a foundation for 3d-pack.

Official website.

Github: https://github.com/tinghuiz/SfMLearner.

Implementation in pytorch: SfmLearner-Pytorch

Depth from Videos in the Wild

Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras by Ariel Gordon et al, 2019.




Here they learn not only depth and R/t but also intrinsics of the camera


Depth Prediction Without the Sensors: Leveraging Structure for UnsupervisedLearning from Monocular Videos

Paper by google



based on vid2depth


another paper by google.


based on sfm learner

monodepth2 - Digging Into Self-Supervised Monocular Depth Estimation



some method that 3D Packing use as a competitor.

RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes

self-supervised from stereo and mono.

by deepai: https://deepai.org/publication/realmonodepth-self-supervised-monocular-depth-estimation-for-general-scenes.

They claim to be better than monodepth2 and mode generalized than depth_from_video_in_the_wild "Depth from Videos in the Wild". However they require camera calibration and median depth to be estimated prior to processing with an external tool. (COLMAP). Not really "in the wild".

They were able to train on multiple scenes with different depth ranges. The method still requires a static scene for the training. For example they use data from Mannequin Challenge to train their models. Then the network can be used on dynamic scenes with people.

The code is not available so far.