Researchers from New York University, Genentech, and CIFAR are pioneering a new approach to multi-modal learning in an attempt to improve its efficacy. Multi-modal learning involves using data from various sources to inform a target label, placing boundaries between the sources to allow for differentiation. This type of learning is commonly used in fields like healthcare, autonomous vehicles and robotics but its success rate can vary greatly depending on the task or dataset at hand.
The researchers’ process involves mapping the data from various sources using a inter- and intra-modality concept known as the I2M2 method. This involves using classifiers for each modality, capturing the dependencies within each one, as well as a classifier to capture the relationships between the output label and interactions cross-modalities.
Previously, multi-modal learning research fell into one of two categories; inter-modal modeling, which heavily relies on identifying inter-modal relationships to predict the target, but often fails due to unfulfilled assumptions about the learning-generating model. While, intra-modality modeling is reliant on labels for interactions between modalities, which limit their effectiveness. The researchers’ new method doesn’t rely on these dependencies in advance, instead, it makes explicit descriptions of interdependence both across and within modalities, adapting to different contexts.
The team tested their I2M2 method on various datasets, including; automatic diagnosis using knee MRI scans, predictions wound mortality and ICD-9 codes in MIMIC-III, as well as vision-and-language tasks like NLVR2 and VQA. The method showed improved performance against both intra- and inter-modality approaches. Importantly, the I2M2 method showed strong performance regardless of the relative importance of either inter- or intra-modality dependencies.
The results suggest that success in multi-modal learning is dependent on integrating both inter- and intra-modality dependencies. The I2M2 method advances current understanding of multi-modal learning and offers a more adaptable approach for future model development. Findings from the study lay the foundation for further investigation into a model-agnostic framework for multi-modal learning.