Monocular estimation of metric depth has long been a challenge for applications such as autonomous driving and mobile robotics. Indoor and outdoor datasets have drastically different RGB and depth distributions, which presents a difficult issue to overcome. Additionally, the inherent scale ambiguity in photos caused by not knowing the camera’s intrinsicity is a further obstacle. Consequently, existing monocular depth models either work with indoor or outdoor settings or only estimate scale-invariant depth if trained for both.
Rather than settling for metric depth, the most common approach is to estimate depth invariant to scale and shift (e.g., MiDaS). This standardization of depth distributions helps to eliminate scale ambiguities and bring the indoor and outside depth distributions closer together. Training joint indoor-outdoor models that estimate metric depth has recently become popular as a way to combine these various methods. One such model, ZoeDepth, attaches two domain-specific heads to MiDaS to handle indoor and outdoor domains.
A new study from Google Research and Google Deepmind takes it one step further and investigates denoising diffusion models for zero-shot metric depth estimation, achieving state-of-the-art performance. Field-of-view (FOV) augmentation is employed throughout training to enhance generalizability to various camera intrinsics; FOV conditioning is employed during training and inference to resolve intrinsic scale ambiguities, leading to an additional performance gain. The researchers also recommend encoding depth in the log scale to use the model’s representation capability better. Plus, a more equitable distribution of model capacity between indoor and outdoor situations is achieved by representing depth in the log domain, leading to improved indoor performance.
The research team’s investigations revealed that v-parameterization significantly boosts inference speed in neural network denoising. With fine-tuning on the same data, the suggested model, DMD (Diffusion for Metric Depth), produced substantially less relative depth error than ZoeDepth on all eight out-of-distributed datasets. Furthermore, adding more data to the training dataset made things even better – DMD achieved a SOTA on zero-shot metric depth, with a relative error that is 25% lower on indoor datasets and 33% lower on outdoor datasets than ZoeDepth. It is also efficient, using v-parameterization for diffusion.
In conclusion, DMD is a groundbreaking diffusion model for enhanced zero-shot metric depth estimation on generic scenes. It is both simple and successful, offering an exciting glimpse at what is possible in the world of AI. We can’t wait to see how future research builds on these ideas and what will be accomplished next!