Large Language Models (LLMs), outstanding in language understanding and reasoning tasks, still lack expertise in the crucial field of spatial reasoning exploration, an area where human cognition shines. Humans are capable of powerful mental imagery, coined as the Mind’s Eye, enabling them to imagine the unseen world, a concept largely untouched in the realm of LLMs. This deficiency reveals innate gaps in understanding spatial concepts and replicating human-like imagination in these models.
Previous research underscores the mighty triumphs of LLMs in language tasks, alongside their woefully under-investigated spatial reasoning abilities. Human cognition heavily relies on spatial reasoning for interacting with the environment while LLMs predominantly depend on verbal reasoning. Humans bolster spatial awareness using mental imagery which aids navigation and sparks mental stimulation, a concept rigorously researched in the fields of neuroscience, philosophy, and cognitive science.
Microsoft researchers have introduced the revolutionary concept of Visualization-of-Thought (VoT) prompting, designed to stimulate and manipulate mental images paralleling the human mind’s eye for spatial reasoning. Utilizing this VoT prompting, LLMs can wield a visuospatial sketchpad to visualize reasoning steps, thereby improving future spatial reasoning. The technique adopts a zero-shot prompt approach, taking advantage of LLMs’ skill to gather mental images from textual visual art, instead of depending on few-shot demonstrations or text-to-image techniques with CLIP.
VoT encourages LLMs to conjure visualizations post every reasoning step forming a series of intertwined reasoning traces. Tracking the visual state through a visuospatial sketchpad, which records partial solutions at every stage, provides the critical grounding in visual context for LLMs’ reasoning, thus refining their spatial reasoning skills needed for tasks such as navigation and tiling.
Performance comparisons affirm that GPT-4 VoT outranks other technologies across all tasks and metrics, asserting the influence of visual state tracking. In the natural language navigation task, GPT-4 VoT outstrips GPT-4 w/o VoT by a margin of 27%, implying VoT’s supremacy. Interestingly, GPT-4 CoT trails GPT-4V CoT in visual tasks, alluding to the benefits of anchoring LLMs with a 2D grid for spatial reasoning.
The landmark research offers several key contributions:
1. It delves into LLMs’ mental imagery for spatial reasoning, studying its distinctive traits and limitations, and tracing its beginnings from code pre-training.
2. It introduces two novel tasks, “visual navigation” and “visual tiling”, backed by synthetic datasets. These provide a diverse range of sensory inputs for LLMs and different complexity levels, offering an excellent testbed for researching spatial reasoning.
3. The research introduces VoT prompting which evokes LLMs’ mental imagery for spatial reasoning, outperforming other prompting techniques and existing Multimodal Large Language Models (MLLMs). This capability mirrors the human mind’s eye process, hinting at its potential applicability in enhancing MLLMs.
In conclusion, this research unveils VoT, which simulates human cognitive activities of visualizing mental images. VoT equips LLMs to excel in multi-hop spatial reasoning tasks, outshining MLLMs in visual tasks. Given its ability to emulate the mind’s eye process, VoT shows a promising future for MLLMs. The results underline VoT’s effectiveness in boosting LLMs’ spatial reasoning skills, indicating its potentiality to revolutionize multimodal language models.