Skip to content Skip to footer

Researchers at Alibaba Present Mobile-Agent: A Self-governing Multi-Modal Mobile Device Agent

Mobile device agents employing Multimodal Large Language Models (MLLM) are becoming more popular due to impressive advancements in visual comprehension capabilities. This technological progression makes MLLM-based agents suitable for a variety of applications, including mobile device operation.

Previously, Large Language Model (LLM)-based agents have been recognized for their task planning capabilities. However, issues in the mobile device agent domain persist with MLLM. Even though MLLM shows promise, models like GPT-4V lack adequate visual perception for executing effective mobile device operations. Prior methodologies used interface layout files for localization but encountered hurdles due to limited file accessibility.

Researchers from Beijing Jiaotong University and Alibaba Group have developed Mobile-Agent, an autonomous multi-modal mobile device agent. This agent uses visual perception tools to identify and pinpoint visual and textual elements within an app’s front-end interface. It then uses this perceived vision context to independently form and deconstruct complex operational tasks, navigating through mobile apps step by step. Unlike prior solutions, Mobile-Agent does not rely on XML files or mobile system metadata. Instead, it utilizes a vision-centric approach, providing more adaptability across different mobile operating environments.

For text and icon localization, Mobile-Agent uses OCR tools and CLIP respectively. The agent can perform eight operations like opening apps, clicking on text or icons, typing, and navigating. The iterative process of planning and reflecting allows the Mobile Agent to complete complex tasks based on instructions and real-time screen analysis.

The researchers also presented Mobile-Eval – a benchmarking tool incorporating ten popular mobile apps with three instructions each, used to evaluate Mobile-Agent’s effectiveness. Mobile-Agent accomplished completion rates of 91%, 82%, and 82% across instructions, with a high Process Score of roughly 80%, demonstrating that Mobile-Agent can operate at 80% of the efficiency of a human. Key to these impressive results is Mobile-Agent’s ability to correct errors in real-time.

In conclusion, the researchers from Beijing Jiaotong University and Alibaba Group have developed Mobile-Agent, an autonomous multimodal agent that can operate a variety of mobile applications. By identifying and locating visual and textual elements within these applications, Mobile-Agent autonomously plans and executes tasks. It eliminates the need for system-specific customizations through a vision-centric approach. The agent’s effectiveness and efficiency have been demonstrated, showing it as a versatile and adaptable solution for language-agnostic interaction within mobile applications.

Leave a comment

0.0/5