Natural Language Processing (NLP) aims to enable computers to understand and generate human language, facilitating human-computer interaction. Despite advancements in NLP, large language models (LLMs) often fall short when it comes to complex planning tasks, such as decision-making and organizing actions – abilities crucial in a diverse array of applications from daily tasks to strategic business decisions. Presently, most AI planning uses predefined languages and requires expert knowledge to operate.
Researchers at Google DeepMind recently introduced NATURAL PLAN, a benchmark designed to evaluate the capacity of LLMs to handle complicated planning tasks in a natural language context. Using real-world data from Google applications such as Flights, Maps, and Calendar, NATURAL PLAN focuses on the trip planning, meeting scheduling, and calendar planning. The tasks were specifically created to mimic real-world planning challenges, testing each LLM’s ability to work under constraints and solve problems with a single correct solution.
However, results from state of the art models like GPT-4 and Gemini 1.5 Pro found that their performance suffered with NATURAL PLAN. GPT-4 achieved a 31.1% success rate in Trip Planning, and 47.0% accuracy for Meeting Planning, while Gemini 1.5 Pro achieved 34.8% in Trip Planning and 39.1% in Meeting Planning. Performance dropped further with an increase in task complexity, such as planning trips across ten cities. Experimentation also revealed that task difficulty and complexity negatively impacted model performance, with models struggling to learn from more complicated examples.
Prompting the models to self-identify and correct their mistakes often resulted in even lower scores. Despite these setbacks, there was some promise in long-context capabilities experiments, with models like Gemini 1.5 Pro showing gradual improvements with more practical examples.
This research shows a clear gap between the planning capabilities of current LLMs and humans, but it also provides insights into the potential and limitations of these models. The findings highlight the need for more robust handling of complex, practical planning tasks by these models, with further research and development necessary to achieve this. NATURAL PLAN provides an efficient benchmark for future evaluation and development of planning capabilities of LLMs. Although there is room for improvement, the research provides hope for future advancements in natural language planning tasks, potentially leading to more proficiency in a variety of fields.