This paper introduces the VisionLLaMA, a large language model based on transformer architectures, designed to bridge the gap between language and vision modalities. It follows the design of the LLaMA family of models and the Vision Transformer (ViT) pipeline, by segmenting an image into non-overlapping patches and processing them through VisionLLaMA blocks. The blocks include features such as self-attention via Rotary Positional Encodings and SwiGLU activation, with VisionLLaMA differing from ViT as it solely relies on the innate positional encoding of its base block.
Two versions of VisionLLaMA were examined: plain and pyramid transformers. The former follows the ViT design while the latter investigates its extension to window-based transformers. Rather than constructing new pyramid transformers, the goal was to show how VisionLLaMA adapts to existing designs. Experiments were conducted to evaluate VisionLLaMA’s performance in diverse vision tasks, including image generation, classification, segmentation, and detection. Results indicated that VisionLLaMA consistently outperforms across model sizes, demonstrating its efficacy as a vision backbone.
VisionLLaMA’s design choices such as SwiGLU, normalization techniques, positional encoding ratios, and feature abstraction methods were investigated in ablation studies, providing insights into its dependability and efficiency. Possibilities like extending VisionLLaMA’s capabilities beyond text and vision to create a more comprehensive and adaptable model architecture were also suggested.
Supervised and self-supervised training performances were compared in the experiments, and models were fine-tuned. Factors contributing to VisionLLaMA’s superior performance, such as the positional encoding technique and the flexibility provided by RoPE, were provided. The paper proposes the use of the VisionLLaMA in vision tasks and encourages further investigation and research into its capabilities.
In conclusion, The VisionLLaMA provides a versatile architecture, facilitating a bridge between language and vision modalities while bringing a considerable impact on the field of vision tasks. Lastly, the paper’s open-source release promotes collaboration and creativity in the field of large vision transformers.