Planetarium: A Novel Benchmark for Assessing LLMs in Converting Natural Language Descriptions of Planning Issues into Planning Domain Definition Language PDDL

Large language models (LLMs) have shown promise in solving planning problems, but their success has been limited, particularly in the process of translating natural language planning descriptions into structured planning languages such as the Planning Domain Definition Language (PDDL). Current models, including GPT-4, have achieved only 35% accuracy on simple planning tasks, emphasizing the need for more effective methods and better benchmarks to evaluate their work.

To address the difficulties of using LLMs for planning tasks, researchers have considered multiple approaches. Directly using LLMs for plan generation has not been highly successful due to their poor performance on even basic tasks. Another method, “Planner-Augmented LLMs,” combines the strengths of LLMs with classical planning techniques, treating the problem as a machine translation task and converting natural language problem descriptions into structured forms like PDDL.

Despite the promise of this hybrid approach, evaluating the generated code, including PDDL translations, remains a challenge. Current evaluation methods such as match-based metrics and plan validators are often inadequate for assessing the relevance and accuracy of the generated PDDL compared to the original instructions.

Recognizing these issues, researchers at Brown University’s Department of Computer Science have introduced Planetarium, a rigorous benchmark for evaluating the ability of LLMs to translate natural language planning problem descriptions into PDDL. Planetarium formally defines planning problem equivalence and offers an algorithm to verify whether two PDDL problems satisfy this definition. Included in the benchmark is a comprehensive dataset of 132,037 PDDL problems and corresponding text descriptions of varying size and complexity.

The introduced evaluation uses a specific algorithm to translate PDDL code into scene graphs that represent both initial and goal states. It determines if two problems are equivalent through several steps, including quick checks for straightforward non-equivalence or equivalence cases, and further examination if those checks fail. For more complex problems, the algorithm has modes for issues where object identity is relevant and where objects in goal states are placeholders.

The Planetarium benchmark evaluates various LLMs’ ability to translate natural language into PDDL. Despite poor performance in zero-shot settings by models including GPT-4o, Mistral v0.3 7B Instruct, and Gemma 1.1 IT 2B & 7B, fine-tuning was found to significantly improve performance across all models.

The Planetarium benchmark marks a significant step in evaluating LLM translation capabilities, addressing essential technical and societal concerns. Even advanced models like GPT-4 only underscore the complexity of these tasks and the need for continued improvement and innovation. As advancements are made, Planetarium offers a critical framework for assessing progress and ensuring reliable results. Overall, this study furthers AI capabilities while underlining the necessity of responsible development in creating trustworthy AI planning systems.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Planetarium: A Novel Benchmark for Assessing LLMs in Converting Natural Language Descriptions of Planning Issues into Planning Domain Definition Language PDDL

Leave a comment Cancel reply

You May Also Like

Surpassing AI’s Future Insight and Decision-Making Boundaries: More than Just Predicting the Next Token

2024 Saudi Arabia Summit on Intelligent Data & Artificial Intelligence

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Planetarium: A Novel Benchmark for Assessing LLMs in Converting Natural Language Descriptions of Planning Issues into Planning Domain Definition Language PDDL

Leave a comment Cancel reply

You May Also Like

Surpassing AI’s Future Insight and Decision-Making Boundaries: More than Just Predicting the Next Token

2024 Saudi Arabia Summit on Intelligent Data & Artificial Intelligence

+60 12-462 2768

All
Categories

All
Categories

All
Categories