Researchers from Stony Brook University, the US Naval Academy, and the University of Texas at Austin have developed CAT-BENCH, a benchmark to assess language models’ ability to predict the sequence of steps in cooking recipes. The research’s main focus was on how language models comprehend plans by examining their understanding of the temporal sequencing of instructions or steps. Evaluations reveal that even advanced models struggle with recognising causal and temporal relations in instructional texts, even when methods like few-shot learning and explanation-based prompting are deployed.
Planning, a critical aspect of decision-making processes, relies heavily on the causal relationships between steps. The effective use, revision, or customisation of plans requires an understanding of the steps involved and their causal connections. While it’s common to evaluate these capabilities in simulated environments, real-world applications pose unique challenges because they cannot be physically tested for accuracy and reliability.
In an effort to examine the ability of language models to recognise and predict the sequence and timing of steps, the researchers created CAT-BENCH. This tool, which includes 2,840 questions across 57 recipes, is designed to evaluate models’ precision, recall, and F1 score when predicting dependencies between steps in a recipe.
Different models were tested on CAT-BENCH, with the highest F1 scores captured by GPT-4-turbo and GPT-3.5-turbo in a zero-shot setting. Adding explanations along with answers generally improved model performance, especially the F1 score for GPT-4o. However, the models demonstrated a bias towards predicting dependencies, which affected their precision and recall scores. The quality of explanations generated by the models also varied, with larger models generally performing better than smaller ones.
Despite improvements in language models, the researchers found that no system accurately determined whether one step in a plan must come before or after another. Inconsistencies were also observed in their predictions. However, prompting models to provide an explanation for an answer before generating a response improved their overall performance. Despite this improvement, there are still areas for enhancement, particularly in understanding step dependencies. These findings highlight current limitations in language learning models for plan-based reasoning applications. Researchers are hopeful that this work will act as a stepping stone to further advances in this field.