Skip to content Skip to footer

Scientific researchers at Apple have proposed a new group of image-text models known as MobileCLIP. They are optimized for real-time performance by implementing multi-modal strengthened training.

In the realm of Multi-modal learning, large image-text foundational models have shown remarkable zero-shot performance and enhanced stability across a multitude of downstream tasks. These models, like Contrastive Language-Image Pretraining (CLIP), have notably improved Multi-modal AI due to their capability to simultaneously assess both images and text. A variety of architectures have recently been shown to successfully accomplish vision tasks on devices with limited resources. Examples include the use of pruning ViT architectures to derive smaller and swifter CLIP models.

However, there are challenges when deploying these models, particularly on mobile devices, due to the large transformer-based encoders that CLIP employs, which have significant memory and latency overheads. This leads to the first of two key issues addressed by this paper: the trade-off in runtime performance and accuracy among different architectures, which in turn hampers the examination of architectural designs. Furthermore, the extensive training required by CLIP models is costly and hinders the rapid growth and examination of DataCompDR-12M and DataCompDR-1B databases.

The second problem stems from the diminished capacity of smaller architectures, which results in inferior accuracy levels. To address these issues, researchers from Apple have developed MobileCLIP, a novel collection of image-text models optimized for runtime performance through multi-modal reinforced training. MobileCLIP offers a new standard system for balancing speed and accuracy, and for retrieval tasks across numerous datasets. This training method improves the accuracy of efficient models by utilizing knowledge transfer from an image captioning model and various robust CLIP encoders.

Specifically, this multi-modal reinforced training approach is combined with DataCompDR to overcome the challenges outlined. By storing synthetic captions and teacher embeddings in the dataset and initiating a dataset reinforcement strategy, accuracy is increased and additional training time is mitigated. This method also involves leveraging knowledge from an image captioning model through synthetic captions and distilling image-text alignments from a range of robust pre-trained CLIP models.

Three compact MobileCLIP variants have been created, with the fastest of these (MobileClip-S0) proving to be five times faster and three times smaller than the standard ViT-B/16 CLIP model. Moreover, multi-modal reinforced training realizes an average performance increase of +2.9% across 38 evaluation benchmarks when training the ViT-B/16 image backbone. The quality of web-sourced datasets is also enhanced using DataComp and data filtering networks.

In summary, the newly proposed model, MobileCLIP, introduces a novel set of efficient image-text models, optimized for runtime performance via multi-modal reinforced training. With the use of DataCompDR, a reinforced training dataset that draws on knowledge from a pre-trained image captioning model and a collection of strong CLIP models, MobileCLIP offers unrivaled balance in speed and accuracy across multiple datasets. This significant research offers a new frontier in optimizing runtime performance within machine learning models, particularly within resource-constrained devices.

Leave a comment

0.0/5