Researchers from HyperGAI have developed a ground-breaking new multimodal language learning model (LLMs) known as Hyper Pretrained Transformers (HPT) that can proficiently handle and process seamlessly, a wide array of input modalities, such as text, images, and videos. Existing LLMs, like GPT-4V and Gemini Pro, have limitations in comprehending multimodal data, which hinders progress towards achieving Artificial General Intelligence (AGI). HPT overcomes these limitations and delivers efficient performance across diverse input formats without causing a significant increase in computational costs.
Unlike traditional LLMs that primarily focus on processing text, HPT uses a multimodal pretraining framework for training large models with an understanding of various modalities. It comes in two versions: HPT Pro, which is designed for handling complex multimodal tasks, and HPT Air, which efficiently caters to a broad spectrum of tasks. Also, the HPT model features an innovative component known as the H-Former, which bridges the gap between vision and language modalities by converting visual data into a language which the machine can understand.
The H-Former uses a dual-network design to learn both global and local features, equipping the model with the ability to comprehend fine details as well as high-level, abstract information across various modalities. It effectively connects vision and language, allowing HPT to interpret visual data even when primarily trained on text.
The performance of the HPT model was tested on benchmarks such as MMBench and SEED-Image where it significantly outperformed its larger counterparts like GPT-4V and Gemini Pro, showcasing its superior capabilities in handling complex multimodal tasks. In comparison with open-source multimodal LLMs of similar or smaller sizes, HPT Air achieved cutting-edge results on challenging benchmarks like MMMU, underscoring its effectiveness and efficiency.
Therefore, the introduction of the HPT framework represents a substantial leap forward in the field of multimodal LLMs. Its unique H-Former design and its ability to efficiently bridge visual and language modalities has demonstrated a performance superior to current models on various benchmarks. It offers exciting new approaches to study and achieve strong multimodal understanding, making it a remarkable advancement in this field.
HyperGAI credits the diligent researchers behind this innovative project. To stay updated with their latest work, follow them on Twitter, Join their LinkedIn Group or their Telegram and Discord Channels. Visit their Blog or Github for more details about HPT. You can also sign up for their newsletter to receive regular updates. Also, make sure to be part of their ML SubReddit Community that has over 39k members.