Researchers at Carnegie Mellon University have developed VisualWebArena, a benchmark designed to assess the performance of autonomous agents in AI through visually stimulating challenges. Current benchmarks mainly evaluate text-based agents, but VisualWebArena takes into account an agent’s ability to process both textual and visual inputs, understand complex natural language instructions, and carry out tasks successfully.
VisualWebArena is unique as it leverages the need for agents to understand visual inputs accurately, enhancing their ability to function in real-world data environments. These include 910 practical tasks in three online settings: Shopping, Reddit, and Classifieds. Among these, the “Classifieds” environment is newly added, and all tasks are visually anchored, requiring a firm understanding of content to resolve effectively.
The benchmark assesses the performance of both Large Language Models (LLMs) and Vision-Language Models (VLMs) in terms of their autonomy. Results have indicated that VLMs fare better than text-only LLMs in VisualWebArena tasks, although their success rate of 16.4% is much lower compared to human performance (88.7%). The research highlighted a significant discrepancy between open-sourced and API-based VLM agents, emphasizing the need for comprehensive assessment metrics.
A unique VLM agent that uses the Set-of-Marks prompting strategy was also proposed, showing improved performance on complex web pages by streamlining the action space. This suggests a potential way to improve autonomous agents’ performance in visually complex web environments.
In conclusion, VisualWebArena can aid in evaluating multimodal autonomous language agents and guide the development of powerful autonomous agents for online tasks in the future. Future research and development should target an improved understanding of interleaving, as it is necessary for about 25.2% of tasks.