Assessing the Comprehension of Language Models Pertaining to Temporal Relations in Process-Oriented Texts: A CAT-BENCH Evaluation

Researchers from Stony Brook University, the US Naval Academy, and the University of Texas at Austin have developed CAT-BENCH, a benchmark to assess language models’ ability to predict the sequence of steps in cooking recipes. The research’s main focus was on how language models comprehend plans by examining their understanding of the temporal sequencing of instructions or steps. Evaluations reveal that even advanced models struggle with recognising causal and temporal relations in instructional texts, even when methods like few-shot learning and explanation-based prompting are deployed.

Planning, a critical aspect of decision-making processes, relies heavily on the causal relationships between steps. The effective use, revision, or customisation of plans requires an understanding of the steps involved and their causal connections. While it’s common to evaluate these capabilities in simulated environments, real-world applications pose unique challenges because they cannot be physically tested for accuracy and reliability.

In an effort to examine the ability of language models to recognise and predict the sequence and timing of steps, the researchers created CAT-BENCH. This tool, which includes 2,840 questions across 57 recipes, is designed to evaluate models’ precision, recall, and F1 score when predicting dependencies between steps in a recipe.

Different models were tested on CAT-BENCH, with the highest F1 scores captured by GPT-4-turbo and GPT-3.5-turbo in a zero-shot setting. Adding explanations along with answers generally improved model performance, especially the F1 score for GPT-4o. However, the models demonstrated a bias towards predicting dependencies, which affected their precision and recall scores. The quality of explanations generated by the models also varied, with larger models generally performing better than smaller ones.

Despite improvements in language models, the researchers found that no system accurately determined whether one step in a plan must come before or after another. Inconsistencies were also observed in their predictions. However, prompting models to provide an explanation for an answer before generating a response improved their overall performance. Despite this improvement, there are still areas for enhancement, particularly in understanding step dependencies. These findings highlight current limitations in language learning models for plan-based reasoning applications. Researchers are hopeful that this work will act as a stepping stone to further advances in this field.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Assessing the Comprehension of Language Models Pertaining to Temporal Relations in Process-Oriented Texts: A CAT-BENCH Evaluation

Leave a comment Cancel reply

You May Also Like

Eric Evans is awarded the Distinguished Public Service Medal by the Department of Defense.

Scientists at the University of Auckland have presented ChatLogic, an advanced tool for multi-step reasoning in large language models, which improves precision in complex tasks by over half.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Assessing the Comprehension of Language Models Pertaining to Temporal Relations in Process-Oriented Texts: A CAT-BENCH Evaluation

Leave a comment Cancel reply

You May Also Like

Eric Evans is awarded the Distinguished Public Service Medal by the Department of Defense.

Scientists at the University of Auckland have presented ChatLogic, an advanced tool for multi-step reasoning in large language models, which improves precision in complex tasks by over half.

+60 12-462 2768

All
Categories

All
Categories

All
Categories