Stereo matching, a fundamental aspect of computer vision for nearly fifty years, involves the calculation of disparity maps from two corrected images. Its application is critical to multiple fields including autonomous driving, robotics and augmented reality. Existing surveys categorise end-to-end architectures into 2D and 3D based on cost-volume computation and optimisation methodologies. These surveys highlight unresolved issues within the field, but also showcase the rapid changes and growth it has undergone, especially in relation to deep learning.
In spite of the field’s accomplishments, deep stereo matching still presents several challenges. One prominent issue is the inability to generalise, particularly when dealing with transitions between actual and synthetic data. Previous surveys in the late 2010s reported on the initial phase of this revolution, but the field has since experienced considerable advancement.
A new study by a team from the University of Bologna offers a detailed analysis of recent developments in deep stereo matching. They focus particularly on paradigm shifts such as the use of transformer-based architectures and innovative architectural designs like RAFT-new stereo, which have revolutionised the field in the 2020s. The team also assess the issues created by these advancements, categorise them, and detail the best methods for overcoming them.
Key findings from the study show the architecture design of RAFT-new stereo has significantly increased resilience to domain changes. The researchers anticipate further frameworks will adopt this new paradigm given its successful application. Considerable interest is also shown towards the ongoing search for innovative and efficient designs. In a further development, the use of thermal, multispectral or event camera pictures as input to stereo-matching networks is gaining traction, providing new concepts for a field that continues to evolve.
Despite the considerable progress made, some core issues remain. High-resolution images continue to be challenging to process as demonstrated by the Booster dataset. Processing non-Lambertian objects is also problematic due to a lack of training data or effective methods. Adverse weather conditions remain a challenge, especially in relation to autonomous driving scenarios.
In conclusion, it is asserted that although visual foundational models have been developed for other computer vision tasks, stereo matching is still necessary and further effort for stereo is anticipated in single-image depth estimates. Through identifying both current challenges and effective solutions, the analysis offers directions for future research that can benefit both newcomers and established professionals in deep stereo matching.