Integrating multiple generative foundation models provides an efficient way of generating outputs across various modalities, such as text, speech, and images, by leveraging each model’s specific capabilities. However, the success of this integration highly depends on the alignment of data across modalities and the utilization of unimodal representations in cross-domain generative tasks.
To tackle this challenge, Google DeepMind researchers have proposed a breakthrough mechanism, Zipper. Current practice in multimodal generative models often involves pre-training models with vocabulary expansion techniques or fine-tuning them on aligned multimodal data. However, these techniques encounter limitations, such as inflexibility in integrating new modalities post-pre-training and requiring massive amounts of aligned cross-modal data.
Zipper offers a novel methodology by harnessing independently pre-trained unimodal decoders, assembling them using cross-attention mechanisms. It denotes an ability to reuse and repurpose pre-trained decoders flexibly while maintaining their unimodal performance.
The Zipper architecture is composed of numerous autoregressive decoder towers, independently pre-trained on a single modality using next-token prediction. The decoders are then consolidated using gated cross-attention layers, permitting regular interchange of information between modalities. The system regulates the embedding dimension size differences and shifts representations from one modality to another by incorporating projection layers during cross-attention.
To evaluate Zipper, experiments were conducted using variants of the PaLM2 models for text backbone and a similar model for speech backbone, pre-trained on the LibriLight dataset. Zipper proved to sustain unimodal performance and exhibited effective alignment capability of cross-attention. It was also found that it required significantly less aligned data.
Zipper is seen as a scalable solution for unimodal decoders’ integration, making modality composition efficient, even without substantial aligned data. It maintains high unimodal performance while achieving competitive cross-modal task results, potentially leading future research to combine more modalities.