Large language models (LLMs) have shown promise in solving planning problems, but their success has been limited, particularly in the process of translating natural language planning descriptions into structured planning languages such as the Planning Domain Definition Language (PDDL). Current models, including GPT-4, have achieved only 35% accuracy on simple planning tasks, emphasizing the need for more effective methods and better benchmarks to evaluate their work.
To address the difficulties of using LLMs for planning tasks, researchers have considered multiple approaches. Directly using LLMs for plan generation has not been highly successful due to their poor performance on even basic tasks. Another method, “Planner-Augmented LLMs,” combines the strengths of LLMs with classical planning techniques, treating the problem as a machine translation task and converting natural language problem descriptions into structured forms like PDDL.
Despite the promise of this hybrid approach, evaluating the generated code, including PDDL translations, remains a challenge. Current evaluation methods such as match-based metrics and plan validators are often inadequate for assessing the relevance and accuracy of the generated PDDL compared to the original instructions.
Recognizing these issues, researchers at Brown University’s Department of Computer Science have introduced Planetarium, a rigorous benchmark for evaluating the ability of LLMs to translate natural language planning problem descriptions into PDDL. Planetarium formally defines planning problem equivalence and offers an algorithm to verify whether two PDDL problems satisfy this definition. Included in the benchmark is a comprehensive dataset of 132,037 PDDL problems and corresponding text descriptions of varying size and complexity.
The introduced evaluation uses a specific algorithm to translate PDDL code into scene graphs that represent both initial and goal states. It determines if two problems are equivalent through several steps, including quick checks for straightforward non-equivalence or equivalence cases, and further examination if those checks fail. For more complex problems, the algorithm has modes for issues where object identity is relevant and where objects in goal states are placeholders.
The Planetarium benchmark evaluates various LLMs’ ability to translate natural language into PDDL. Despite poor performance in zero-shot settings by models including GPT-4o, Mistral v0.3 7B Instruct, and Gemma 1.1 IT 2B & 7B, fine-tuning was found to significantly improve performance across all models.
The Planetarium benchmark marks a significant step in evaluating LLM translation capabilities, addressing essential technical and societal concerns. Even advanced models like GPT-4 only underscore the complexity of these tasks and the need for continued improvement and innovation. As advancements are made, Planetarium offers a critical framework for assessing progress and ensuring reliable results. Overall, this study furthers AI capabilities while underlining the necessity of responsible development in creating trustworthy AI planning systems.