Artificial Intelligence’s powerful autoregressive (AR) large language models (LLMs), like the GPT series, have made significant progress in achieving general artificial intelligence (AGI). These models use self-supervised learning to predict the next token in a sequence, allowing them to adapt to a diverse range of unseen tasks through zero-shot and few-shot learning. This adaptability makes them excellent candidates for learning from large amounts of unlabeled data.
In recent years, the field of computer vision has been investigating the use of large autoregressive or world models to recreate the scalability and generalizability seen in language models. Models such as VQGAN and DALL-E have shown promising results in image generation, using a visual tokenizer to convert continuous images into 2D tokens and then compress them into a 1D sequence for AR learning. However, the scaling laws of these models are yet to be fully understood, and they currently fall short of the performance of diffusion models.
To overcome these challenges, researchers at Peking University have developed a new AI approach to autoregressive learning for images, known as Visual AutoRegressive (VAR) modeling. The VAR model is inspired by the hierarchical nature of human perception and the principles of multi-scale systems. VAR encodes images into multi-scale token maps, and the autoregressive process starts from a low-resolution token map and gradually increases to higher resolutions. This method dramatically improves AR baseline performance, particularly for the ImageNet 256×256 benchmark.
The researchers’ empirical testing of the VAR models has revealed scaling laws similar to those observed in LLMs, suggesting that these models have significant potential for future development and application in a variety of tasks. Notably, VAR models have demonstrated zero-shot generalization capabilities in tasks such as image in-painting, out-painting, and editing. This accomplishment not only represents a significant improvement in the performance of visual autoregressive models but also marks the first time that AR methods similar to GPT have surpassed strong diffusion models in image synthesis.
In conclusion, the researchers have contributed a new visual generative framework that uses a multi-scale AR paradigm, provided empirical validation of the scaling laws of these models and their zero-shot generalization potential, and significantly advanced the performance of visual AR models. They also supplied an extensive open-source code suite. These efforts aim to drive the future development of visual AR learning and bridge the gap between language models and computer vision, opening up new possibilities in AI research and application.