Large language models (LLMs) are crucial in the field of natural language processing (NLP). However, their performance in tasks requiring visual and spatial reasoning is generally poor. Researchers from Columbia University have proposed a new approach to tackle this issue. Their method, called Whiteboard-of-Thought (WoT) prompting, aims to enhance the visual reasoning abilities of multimodal large language models (MLLMs) across modalities.
WoT prompting provides a figurative ‘whiteboard’ for MLLMs to draw out reasoning steps as images, which are then returned to the model for additional processing. This technique uses the model’s existing ability to create code with libraries such as Matplotlib and Turtle, without requiring examples or special modules. It has shown impressive results on four challenging natural language tasks that require visual and spatial reasoning.
The goal of WoT is to equip MLLMs with the ability to generate and visually process images to better answer queries. While current MLLMs typically lack the inherent ability to produce visual outputs, the researchers demonstrated how visuals could be created using a model designed only to generate texts. The visuals created for this process are minimalistic, abstract, and symbolic.
The study also found that chain-of-thought (CoT) prompting – a popular method in NLP – often failed badly in some scenarios, sometimes achieving 0% accuracy. In contrast, WoT managed to achieve an accuracy of up to 92% in the same scenarios.
The research further revealed that LLMs using text perform best in a 2D grid setting but might perform poorly in other geometrical setups. This might be because a grid setting is easier to represent as coordinates in text, especially a simple square, and such data is widely available online.
WoT performs evenly across various geometries, eliminating the reliance on 2D-grid-specific textual knowledge and focusing on the general applications of the approach. These findings highlight interesting differences in spatial understanding between humans and LLMs.
Conclusively, the WoT method enables visual reasoning across modalities in MLLMs by generating code that can create a visual, which is then sent back to the model for further reasoning. While this presents significant advances in tasks needing visual and spatial comprehension, WoT also requires accurate vision systems. Therefore, future research should focus on improving MLLM’s understanding of detailed geometric figures.