Skip to content Skip to footer

AppWorld: A Uniform Artificial Intelligence Infrastructure Providing a Stable Environment for Evaluating Interactive Coding in API-Related Tasks

As technology continues to advance, the prospects for automation in our daily digital lives are expanding. There’s a rise in the ability of large language models (LLMs) to follow instructions, code, and use tools effectively. Many everyday digital tasks involve complex activities across multiple applications, requiring reasoning and decision-making based on intermediate results. A key challenge in this space is the need for robust, reproducible evaluation of autonomous agents using realistic tasks. Current benchmarks, such as Tool-Usage Benchmarks (TUB) and Interactive Code Generation Benchmarks (ICGB), fall short in this regard as they rely on linear sequences of API calls and are not suitable for handling complex tasks with multiple valid solutions.

To tackle this issue, researchers from Stony Brook University, the Allen Institute for AI, and Saarland University have introduced the AppWorld Engine, an execution environment that includes 9 applications across various domains, such as email, money transfer, shopping, and local file systems. This environment, made up of 60K lines of code, can simulate realistic digital activities for about 100 fictitious users by using 457 APIs that closely resemble real app functionalities. Furthermore, they developed the AppWorld Benchmark, a collection of 750 complex tasks for autonomous agents, which require rich and interactive code generation. This enables robust programmatic evaluation through state-based unit tests.

The results show that current models struggle with the tasks, indicating the challenges facing LLMs in automating such complex digital tasks. For instance, the strongest model tested, ReAct + GPT4O, only achieved a Task Completion Score of 48.8 on Test-N, which fell to 30.2 on Test-C. The second-best model, GPT4Trb, trailed behind significantly, and open models performed even worse. This highlighted a 30-50% reduction from task to scenario scores, indicating that models are not consistently completing all task variants within the same scenario.

In conclusion, the AppWorld Engine represents a major leap forward in the pursuit of effective automation in our digital lives. It provides an execution environment that can evaluate the ability of autonomous agents to handle complex, interactive API-based tasks realistically. While the initial results underscore the challenges that lie ahead, the system’s modularity and extensibility provide opportunities for improvement, including the potential for user interface control, coordination among multiple agents, and the exploration of privacy and safety issues in digital assistants.

Leave a comment

0.0/5