Google Cloud AI researchers have unveiled a novel pre-training framework called LANISTR, designed to effectively and efficiently manage both structured and unstructured data. LANISTR, which stands for Language, Image, and Structured Data Transformer, addresses a key issue in machine learning; the handling of multimodal data, such as language, images, and structured data, specifically when certain types of data are missing, which is a common situation in large-scale, unlabeled, structured data like tables and time series.
Modern multimodal data pre-training methods often depend on the availability of all data types for training and inference. These methods are not suitable in situations where data is missing or incomplete. In such instances, the traditional fusion techniques usually deployed by these methods can yield suboptimal results due to missing or absent modalities. LANISTR offers a solution by utilizing unimodal and multimodal masking strategies to create a robust pre-training objective that effectively manages the absence of data types.
The unimodal masking technique employed by LANISTR involves the masking of data parts within each modality during training. This process forces the model to learn intra-modality contextual relationships. Multimodal masking, on the other hand, involves the masking of entire modalities, training the model to predict the missing modalities from the available ones. This process employs a similarity-based objective, ensuring the generated representations for missing modalities match the available data.
LANISTR’s effectiveness was evaluated using two real-world datasets from the healthcare and retail sectors: the MIMIC-IV dataset and the Amazon Product Review dataset, respectively. The framework demonstrated efficacy in dealing with out-of-distribution scenarios and showed a substantial increase in accuracy and generalization, even with limited labelled data.
Overall, LANISTR offers a solution to the critical problem of missing modalities in large-scale unlabeled datasets in multimodal machine learning. By employing a unique blend of unimodal and multimodal masking strategies and a similarity-based multimodal masking objective, LANISTR ensures robust and efficient pre-training. Crucially, it can effectively learn from incomplete data and generalize well to unfamiliar, unseen data distributions, making it a valuable tool for advancing multimodal learning. The team’s research paper with more detailed information is available online, and interested readers are invited to join their online community channels for updates and discussion.