Vision-language models (VLMs) like GPT-4V have made significant progress in AI tasks, but they struggle with spatial reasoning. This limitation is especially problematic in fields such as robotics and augmented reality, which require a detailed understanding of the positions and relationships of objects in three-dimensional space.
A team of researchers from Google DeepMind and Google Research discovered that this limitation doesn’t arise from the architecture of the VLMs themselves. Instead, it is a result of not having enough 3D spatial data in the training datasets. To address this, they developed SpatialVLM. This is a new system that specifically aims to enhance the spatial reasoning capabilities of VLMs.
The team created a unique, large-scale spatial reasoning dataset and trained SpatialVLM using this data. The dataset included detailed 3D spatial annotations extracted from two-dimensional images, which they obtained through a combination of open vocabulary detection, metric depth estimation, semantic segmentation, and object-centric captioning models.
In their tests, SpatialVLM consistently performed better than other vision-language models on spatial reasoning tasks. This was due to its ability to handle both qualitative and quantitative spatial queries effectively, even when the training data was noisy. This has made it a valuable tool in sophisticated robotics tasks, particularly in terms of reward annotation.
By integrating SpatialVLM with a Large Language Model, it can perform spatial chain-of-thought reasoning, which means it’s able to process and solve multi-step spatial challenges. This is a capability that’s extremely beneficial in fields needing advanced spatial analysis.
The team has already started to explore SpatialVLM’s use in a range of novel downstream applications in spatial reasoning and robotics, such as as a dense reward annotator and a success detector.
The development of SpatialVLM represents a significant advancement in AI technology. It allows for enhanced accuracy in spatial reasoning within VLMs, superior performance on related tasks, and the ability to perform complex spatial chain-of-thought reasoning. It’s likely to be especially beneficial in industries like robotics which rely on sophisticated spatial analysis.
More information on SpatialVLM can be found in the original project paper. Credit for this research belongs entirely to the team at Google DeepMind and Google Research.