Apple’s progress in developing state-of-the-art artificial intelligence (AI) models is detailed in a new research paper focused on multimodal capabilities. Titled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training,” the paper introduces Apple’s first family of Multimodal Large Language Models (MLLMs) which display remarkable skills in image captioning, visual question answering, and natural language inference.
The MM1 models have up to 30 billion parameters, three times the number found in OpenAI’s GPT-4V, a component that gives GPT-4 its vision capabilities. According to the researchers, carefully selecting image-caption pairs allowed them to achieve exceptional results, particularly in few-shot learning scenarios.
A unique strength of the MM1 models is their superior ability to interpret instructions across multiple images and to reason through the complex scenes these images present. This is a key differentiator separating MM1 from other MLLMs on the market.
To train these models, a large dataset was used, which consisted of 500M interleaved image-text documents containing a billion images and 500 billion text tokens. Due to its large scale and diversity, the MM1 models can perform impressive in-context predictions and follow custom formatting with just a few examples.
One aspect of AI that poses a major challenge is how to facilitate models that can “see” and reason, which requires a vision-language connector translating images and language into unified data sets that the AI can subsequently process. The research detected that the design of this vision-language connector was less significant in the performance of the MM1 models; instead, the image resolution and image tokens contributed more heavily.
In an effort to contribute to the broader AI community, Apple has demonstrated an openness in sharing its research. The researchers aimed to document the process of building the MLLM and formulate lessons from the development that can be beneficial to the global AI community. The results published will likely lead other MMLM developers in their architectural and pre-training data choices.
It is currently unclear how the MM1 models will be integrated into Apple’s products. However, the capabilities detailed in the examples published suggest that Siri may become significantly more intelligent upon learning this visual feature.
Apple’s release of the research paper on the MM1 models marks their first official disclosure in the field of multimodal LLMs and shows a promising future in the application of AI within the company’s products.