Artificial General Intelligence (AGI) advancement has been tied to successful interaction with complex scenarios and tasks using large multimodal models (LMMs) and advanced tools. In this process, one stumbling block is the difficulty of generalizing across different scenarios due to significant differences in observations and actions required across settings. Experts have proposed leveraging the General Computer Control (GCC) setting, an approach designed to interpret screen images (and maybe audio) and convert them into keyboard and mouse operations, as humans traditionally interact with computers.
Realising GCC, however, comes with hurdles such as dealing with multimodal observations, ensuring precise control of keyboard and mouse, requiring long-term memory and reasoning, and fostering efficient exploration and self-improvement.
To tackle these challenges, researchers have proposed the CRADLE framework, a groundbreaking solution comprising six main modules; information gathering, self-reflection, task inference, skill curation, action planning, and memory. This structure aims to understand and interact with digital environments in unique ways. CRADLE’s test deployment in the complex AAA game Red Dead Redemption II demonstrates its potential to navigate, learn, and perform in complex virtual worlds without prior intrinsic knowledge of the game’s mechanics.
CRADLE’s information-gathering module processes screen images to extract relevant information, including textual and visual data, which allows comprehension of the current scenario and facilitates appropriate planning. It also can translate in-game instructions into executable keyboard and mouse actions, enabling nuanced and effective interaction with the game. This interaction is refined through reasoning modules, which assess action outcomes and plan future steps according to previous experiences and collected information.
Tests of CRADLE in Red Dead Redemption II validate its ability to successfully complete diverse tasks with minimal reliance on pre-existing knowledge, representing a significant step towards GCC realization. However, research detected limitations such as spatial perception shortfalls, inadequate icon understanding, and history processing problems.
Despite these setbacks, CRADLE’s performance underlines the viability of LMM-based agents to perform real missions in complex games and gives insights into creating more robust and adaptable agents for computer control tasks. The success of CRADLE represents a substantial progression in the AGI pursuit through the GCC setting and offers a glimpse into a future where digital agents can interact with a wide range of computer tasks, seamlessly navigating and performing in the digital world.
Plans for future developments to CRADLE include broadening its application range, enhancing multimodal input handling, and refining decision-making processes, all of which may revolutionize our AGI and digital interaction approach.