Mobile applications play a crucial role in day-to-day life, however, the diversity and intricacy of mobile UIs can often pose challenges in terms of accessibility and user-friendliness. Many models struggle to decode the unique aspects of UIs, such as elongated aspect ratios and densely packed elements, creating a demand for specialized models that can interpret the complex world of mobile apps.
Several existing models, such as the RICO dataset, Pix2Struct, and ILuvUI, have aimed to address this by providing structural analysis and language-vision modelling. Models such as CogAgent and Spotlight use screen images for UI mapping, while Ferret, Shikra, and Kosmos2 enhance reference and grounding capabilities but are mostly focused on natural images. Despite being reliant on external modules or predefined actions, MobileAgent and AppAgent have attempted to encourage more intuitive interactions by utilising Multimodal Large Language Models (MLLMs) for screen navigation.
In an attempt to improve the understanding and interaction with UIs, researchers at Apple have developed a new model, Ferret-UI. This model sets itself apart from existing ones by incorporating an “any resolution” capability, allowing it to adapt to various screen aspect ratios and handle fine details within UI elements. This innovative approach offers a more nuanced understanding of mobile interfaces.
Ferret-UI uses an “any resolution” strategy to process UI screens by dividing them into smaller images for more detailed focus. It’s trained using the RICO dataset for Android and proprietary data for iPhone screens, which includes tasks related to widget classification, icon recognition, OCR, and grounding tasks such as finding widgets and icons. The model uses varying visual features to enhance its understanding of and interaction with mobile UIs.
This innovative approach has proven to be effective, with Ferret-UI outstripping open-source UI MLLMs and GPT-4V in task-specific performances. It achieved a 95% accuracy rate in icon recognition tasks, a 25% increase compared to the nearest competitor model. Its success rate for widget classification was 90%, exceeding GPT-4V by 30%. In finding widgets and icons, it managed 92% and 93% accuracy respectively, showing a significant improvement on existing models. These results provide evidence of Ferret-UI’s superior capabilities in understanding mobile UIs.
In summary, Apple’s Ferret-UI is a cutting-edge model that is helping to improve understanding of mobile UIs. Using a detailed aspect-ratio adjustments and comprehensive datasets, it has achieved impressive results in task-specific performance metrics. Not just significant in its numerical success, Ferret-UI exemplifies the potential for more intuitive and user-friendly mobile app interactions, paving the way for future advancements in UI comprehension.