Researchers from New York University, Genentech, and CIFAR have proposed a new paradigm to address inconsistencies in supervised multi-modal learning referred to as Inter & Intra-Modality Modeling (I2M2). Multi-modal learning is a critical facet of machine learning, used in autonomous vehicles, healthcare, and robotics, among other fields, where data from different modalities is mapped to a target label. However, the effectiveness of this approach varies depending on the task it’s applied to, sometimes outperforming, equaling, or underperforming compared to single modality or dual modality learners.
The team is introducing a principled approach to multi-modal learning to help understand the performance differences and develop better models to utilize multi-modal data. The proposed approach uses a unique probabilistic perspective to create a mechanism that generates data and examines the multi-modal learning problem. This mechanism emphasizes the differences in dependencies between modalities and labels in dataset variances.
The I2M2 method is based on the multi-modal generative model, a commonly used approach in multi-modal learning. Prior research in this area can be divided into two groups: Inter-modal modeling and Intra-modality modeling. Inter-modal methods heavily focus on detecting inter-modal relationships to predict the target but often fail due to unfulfilled assumptions about the multi-modal learning model. In contrast, intra-modality methods rely solely on labels for interactions between modalities, which limits their effectiveness.
The I2M2 modeling method improves on these shortcomings by providing a framework that explicitly captures both inter and intra-modality dependencies. This approach allows it to adapt to different contexts and still deliver high performance. Initial results have demonstrated the potential of I2M2 in healthcare settings, including in automatic MRI knee scan diagnoses and mortality and ICD-9 code predictions in the MIMIC-III dataset.
The research confirmed that dependencies vary in strength between datasets. For example, the fastMRI dataset benefits more from intra-modality dependencies, while the NLVR2 dataset relies more on inter-modality dependencies. Regardless of the relative importance of dependencies, the I2M2 method performed well in all contexts, proving its robustness and adaptability. The researchers’ work establishes confidence in I2M2’s effectiveness in multi-modal learning. Their research paper is available for review, and further updates can be accessed via Twitter or their ML SubReddit.