Creating high-quality, diverse media from text is often a challenging task for existing models. Such models either generate low-quality outcomes, are slow, or need a significant level of computational power. Current solutions that resolve individual tasks such as text-to-image or text-to-video generation need to be merged with other models to achieve the desired effect. Moreover, they often require a lot of computational resources, making them inaccessible for widespread use.
Lumina-T2X addresses these hurdles by introducing a series of Diffusion Transformers capable of transforming text into various kinds of media, such as images, videos, multi-view 3D images, and synthesized speech. Its Flow-based Large Diffusion Transformer, or Flag-DiT, which serves as its core element, can handle sequences as long as 128,000 tokens and can support up to 7 billion parameters. This model incorporates different types of media into a single token space, enabling it to generate outputs of any duration, aspect ratio, and resolution.
One of the distinct features of Lumina-T2X is that it can encode any form of modality into a 1-D token sequence, whether it is an image, a video, a 3D object view, or a speech spectrogram. It introduces unique tokens allowing it to generate high-resolution content beyond the resolutions it was initially trained on. Thus, it can produce videos and images with resolutions not seen during training, ensuring high-quality outputs for out-of-domain resolutions.
Additionally, Lumina-T2X shows faster training convergence and stable dynamics, thanks to advanced techniques such as RoPE, RMSNorm, and KQ-norm. It is designed to work with fewer computational resources without compromising performance. For instance, Lumina-T2I, setup with a 5B Flag-DiT and a 7B LLaMA as the text encoder, only requires 35% of the computational power compared to other top models. Despite its efficiency, the model still excels at generating high-resolution images and coherent videos using carefully selected text-image and text-video pairs.
In summary, Lumina-T2X offers a robust and efficient solution for creating diverse media from text descriptions. It integrates advanced techniques and carries multiple modalities within a single framework, addressing the shortcomings of existing models. Furthermore, it has the ability to produce high-quality outputs with fewer computational demands, making it a valuable tool for a variety of applications in media generation.