Language models are widely used in artificial intelligence (AI), but evaluating their true capabilities continues to pose a considerable challenge, particularly in the context of real-world tasks. Standard evaluation methods rely on synthetic benchmarks – simplified and predictable tasks that don’t adequately represent the complexity of day-to-day challenges. They often involve AI-generated queries and use of basic ‘dummy’ tools, offering an unrealistic measure of an AI’s ability to interact with genuine software and services.
Recognizing these shortcomings, a team of researchers from Shanghai Jiao Tong University and Shanghai AI Laboratory have proposed a new benchmark for evaluating large language models (LLMs): the General Tool Agents (GTA) benchmark. Devised to provide a more accurate and exhaustive assessment of LLM capabilities, the GTA benchmark features human-written queries incorporating implicit tool-use requirements, as well as real-life tools from various categories, including perception, operation, logic, and creativity. The GTA benchmark also incorporates multimodal inputs to closely emulate real-world contexts, providing a robust and comprehensive test of an LLM’s ability to plan and carry out complex tasks.
Comprising 229 intricate real-world tasks, the GTA benchmark is an innovative evaluation mode designed to assess LLMs on more than one dimension. The step-by-step and end-to-end modes provide insights into an LLM’s planning abilities, tool selection, action prediction, and task execution. However, the results have demonstrated that current LLMs like GPT-4 and GPT-4o could solve fewer than 50% of the tasks, with other models clocking less than 25% accuracy.
Despite this, the researchers believe the shortcomings highlighted by the GTA benchmark aren’t insurmountable. Open-source models such as Qwen-72b achieved higher accuracy rates, indicating that it’s possible to enhance LLMs to better meet real-world requirements. The GTA benchmark is thus an essential tool in setting a new, much-needed standard for evaluating LLMs. This benchmark could shape future research and development efforts, pinpointing exactly where improvements in tool-use proficiency are needed for these intelligent systems to truly excel in real-world scenarios. As such, the GTA benchmark ultimately underlines the pressing need for continued advancements in the development of general-purpose tool agents.
It’s worth noting the research findings were disclosed by the team from Shanghai Jiao Tong University and Shanghai AI Laboratory and should be credited accordingly. Stay in touch with the latest in AI research by connecting with us on Twitter, our Telegram Channel and LinkedIn Group, or subscribing to our newsletter.
You can also join the 46K+ participants on our ML Subreddit and explore upcoming AI webinars posted on MarkTechPost.