Advancements in vision-language models (VLMs) have enabled the possibility of developing a fully autonomous Artificial Intelligence (AI) assistant that can perform daily computer tasks through natural language. However, just having the reasoning and common-sense abilities doesn’t always lead to intelligent assistant behavior. Thus, a method to translate pre-training abilities into practical AI agents is crucial.
Even today, leading VLMs, such as GPT-4V and Gemini 1.5 Pro, face difficulties while performing device tasks. The paper discussed here considers three existing methods to address this. These are, first, training multi-modal digital agents—an issue complicated by controlling the devices directly at the pixel level. Second, the creation of environments for device control agents—these environments are often tested in deterministic and unchanged settings. Lastly, there’s the Reinforcement Learning (RL) for LLM/VLMs method, which mainly focuses on single-turn tasks such as preference optimization but can lead to sub-optimal strategies for multi-step issues.
The paper introduces a novel autonomous RL method known as DigiRL (RL for Digital Agents)—developed by researchers from UC Berkeley, UIUC, and Google DeepMind—for training device control agents. This method has produced an AI agent that shows sophisticated performance on several Android device-control tasks. To create the agent, an initial offline RL phase uses existing data while an offline-to-online RL phase finely tunes the model. A robust general-purpose evaluator, with an average error rate of only 2.8% against human judgment, was developed as part of a scalable and parallelizable Android environment to train online RL.
The researchers conducted experiments to evaluate DigiRL on challenging Android device control problems and compare it against existing state-of-the-art agents. The agent trained using DigiRL was then tested on a variety of tasks from the Android in the Wild dataset (AitW) using real Android device emulators. The results were significantly improved; the agent achieved a 28.7% better performance than the existing leading agents (improving the success rate from 38.5% to 67.2%) and outperformed the previous best autonomous learning method.
In conclusion, the researchers proposed DigiRL—a new autonomous RL method for training device-control agents—as a tool that offers a new benchmark in performance on several Android control tasks from the AitW dataset. A scalable and parallelizable Android learning environment was developed alongside a secure VLM-based general-purpose evaluator for quick online data collection. This made DigiRL the base algorithm for training AIs, with future work planned on expanding the task scope and improving algorithmic research. However, this method has only been implemented for tasks on the AitW dataset rather than all possible device tasks. Despite this, with only 1.3 billion parameters, the agent outperformed advanced models like GPT-4V and Gemini 1.5 Pro.