Foundational models, which are vast deep-learning neural networks used as a platform for developing effective machine learning models, are essential in the field of natural language processing and computer vision. They also play a crucial role in Monocular Depth Estimation (MDE) – a process of estimating depth from one image, widely used in autonomous vehicles, robotics, and virtual reality. However, developing datasets with numerous depth labels is difficult, thereby limiting the exploration of MDE, with some models showing inferior performance.
To tackle this issue, researchers from The University of Hong Kong, TikTok, Zhejian Lab, and Zhejian University have created a foundational MDE model capable of generating high-quality depth information from images. Unlike traditional depth datasets which are time-consuming and costly to produce, the researchers focused on large-scale unlabeled data that are easy to acquire and annotate.
Using both labeled and unlabeled data, with a primary focus on the latter, 1.5 million labeled images were collected from six public datasets. For the unlabeled images, a depth engine was created to automatically generate depth annotations, forming a self-learning pipeline. During the joint learning phase, the model is subjected to a more complex optimization target, acquiring more knowledge. The researchers also suggested the use of rich semantic priors from pre-trained encoders for better scene understanding.
The evaluation process compared their model’s zero-shot depth estimation capabilities against MiDaS v3.1, and the results showed that the model, named “Depth Anything”, outperformed MiDaS in a wide variety of scenarios. Additionally, Depth Anything surpassed ZoeDepth, a model based on MiDaS, in metric depth estimation. When evaluating semantic segmentation, it was observed that Depth Anything excelled at MDE and semantic segmentation tasks, demonstrating potential for use as a multi-task encoder.
In conclusion, Depth Anything provides a robust solution to MDE, focusing on affordable and diverse unlabeled images. The researchers heightened the optimization target when learning from unlabeled images and retained rich semantic priors from pre-trained models, resulting in superior performance and zero-shot estimation capabilities. Depth Anything’s potential was underscored by surpassing the latest MiDaS mode, indicating its potential in future depth estimation tasks.
A research paper and Github repository are available for more details on the research, with credit given to the project’s researchers. Followers can engage on social media platforms including Twitter, Google News, and Telegram Channel, or join communities such as a SubReddit with +36k Machine Learning enthusiasts, a 41k+ Facebook community, Discord Channel, and LinkedIn Group. They also offer a newsletter for those interested in their work.