Recent multimodal foundation models are often limited in their ability to fuse various modalities, as they typically utilize distinct encoders or decoders for each modality. This structure limits their capability to effectively integrate varied content types and create multimodal documents with interwoven sequences of images and text.
Meta researchers, in response to this limitation, have…
