Skip to content Skip to footer

Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has revealed that language models without image experience still understand the visual world. The team found that even without seeing images, language models could write image-rendering code that could generate detailed and complicated scenes. The knowledge that enabled this process came from the vast amount of text data available on the internet, including code.

The CSAIL team conducted experiments to test the visual knowledge of these language models. Using their own “Visual Aptitude Dataset,” the team assessed the models’ abilities to draw, recognize, and self-correct images. They trained the language models to write image-rendering code for specific images, progressively refining the generated images. The researchers then used these images to train a computer vision system to identify the contents of real photos. Interestingly, the system demonstrated a surprisingly high performance, even surpassing datasets that were trained with authentic images.

The CSAIL team explained that one of the major difficulties was the inconsistency in the rendering of images between a language model and a diffusion model. They suggested that language models could be used to create sketches for diffusion models for more effective results. However, the language models sometimes struggled to recognize the same images they were capable of drawing. This was evident when they failed to correctly identify human recreations of images previously drawn by the models.

In addition to their experiments, the team explored whether language models had the capacity to draw the same concept differently each time. They found that when queried to draw concepts such as strawberries and arcades multiple times, the models continually produced images from different perspectives and shapes, hinting that the models may actually possess a mental imagination of visual concepts.

Ultimately, the CSAIL team believes this process lays the groundwork for evaluating the training capacity of generative AI models for computer vision systems. They plan on conducting additional research that will delve further into visual knowledge, using the original training set of the language model.

Leave a comment

0.0/5