Transformers have a broad range of applications in tasks such as text classification, map construction, object detection, point cloud analysis, and audio spectrogram recognition. They are also competent in multimodal assignments, as demonstrated by CLIP’s use of image-text pairs for enhanced image recognition. This reflects the effectiveness of transformers in establishing universal sequence-to-sequence modeling and creating embeddings that provide a unified data representation across multiple modalities.
A methodology exemplified by CLIP uses data from one modality (text) to improve model performance in another modality (images). Nonetheless, the necessity for related paired data sets, which is often ignored in research, is a considerable limitation. For example, using image-audio pairs for training could enhance image recognition. However, it remains uncertain whether a standalone audio dataset without meaningful links between audio and image samples could improve ImageNet classification.
A collaboration between The Chinese University of Hong Kong and Tencent AI Lab has led to the proposal of the Multimodal Pathway Transformer (M2PT). The goal is to enrich transformers devised for specific modalities, such as ImageNet, by including irrelevant data from unrelated modalities like audio or point cloud datasets. This unique method shows significant and consistent performance enhancements across image, point cloud, video, and audio recognition tasks by connecting transformers of differing modalities, with intentionally unrelated data samples from the target and auxiliary modalities.
Through the use of pathways, M2PT connects elements of a target modality model with an auxiliary model, facilitating the processing of target modality data by both models. This approach uses the transformer’s exceptional sequence-to-sequence modeling capabilities from two modalities. It employs a modality-specific tokenizer and task-specific head and integrates auxiliary model transformer blocks using cross-module re-parameterization to exploit more weights without inference costs.
Experiments conducted by the researchers in image recognition, using the ViT-B architecture across models, showed considerable improvements in accuracy and task performance on ImageNet, MS COCO, and ADE20K. Strikingly, the Multimodal Pathway Transformer excelled in APbox, APmask, and mIOU metrics compared to baseline models when used for point cloud data.
The paper introduces the Multimodal Pathway as a technique to enhance transformer performance on a specific modality by incorporating irrelevant data from other modalities. It includes Cross-Modal Re-parameterization as a tangible implementation, permitting the use of auxiliary weights without extra inference costs. The methodology has consistently shown significant performance improvements across different recognition tasks, reemphasizing the value in using irrelevant data from various modalities in transformer-based models.
The full paper and related research materials are available on Github, with due credit to the associated researchers. To stay updated with Machine Learning advancements, consider joining their ML SubReddit, Facebook Community, Discord Channel, LinkedIn Group, Twitter, and Telegram Channel. Also, you may subscribe to their newsletter to enjoy similar valuable insights.