ZebraLogic: An AI Benchmark Created for Assessing Language Models through Logical Puzzles

The article introduces a benchmark known as ZebraLogic, which assesses the logical reasoning capabilities of large language models (LLMs). Using Logic Grid Puzzles, the benchmark measures how well LLMs can deduce unique value assignments for a set of features given specific clues. The unique value assignment task mirrors those that are often found in assessments such as the Law School Admission Test (LSAT).

The article presents an example of a 2×3 Logic Grid Puzzle with two houses and three features: names, car models, and animals. Through a process of logical deduction, solutions are achieved—for instance, Eric lives in House 1, owns a Ford F150, and keeps horses. Such an example illustrates the process required to arrive at a solution through logical deduction—a skill that ZebraLogic seeks to measure.

ZebraLogic consists of 1,000 programmatically generated puzzles, and the large language models are tested using a one-shot example approach. Models are provided with the steps to reasoning and a solution in a JSON-formatted example, after which they are required to output their reasoning process and answers in a similar JSON format. This allows for a standardised evaluation of the abilities of the models across various puzzle complexities.

ZebraLogic uses two primary metrics: puzzle-level accuracy and cell-wise accuracy. For a puzzle of size NxM, cell-wise accuracy measures the percentage of correctly filled cells out of the total number of NxM cells. Puzzle-level success, on the other hand, requires all cells to be correct. Difficulty and complexity of these puzzles increase with the size of the puzzle.

According to the study, LLMs generally underperformed in logical reasoning tasks. Claude 3.5 Sonnet, one of the models, achieved an overall accuracy of only 33.4% and 12.4% on complex puzzles. The deficit of abilities in counterfactual thinking, reflective reasoning, structured memorisation, and compositional generalisation were found to be factors contributing to the underperformance of LLMs. Contrastingly, human performance varies with the size of the puzzle, ranging from 15 seconds for a 2×2 puzzle to 10-15 minutes for a 4×4 puzzle.

The article concludes by describing the systematic steps involved in creating these logic grid puzzles, from defining features to formatting puzzles for LLM input. Finally, it states that LLMs struggle with complex logical reasoning despite the best of them only solving 33.4% of all puzzles and 12.4% of the challenging ones. Therefore, it seeks to highlight these challenges of logical reasoning and initiate further research and improvements in the field.

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All
Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

ZebraLogic: An AI Benchmark Created for Assessing Language Models through Logical Puzzles

Leave a comment Cancel reply

You May Also Like

Could the Next Medical Innovation be Concealed in Simple Text? Introducing NATURAL: A Procedure for Inferring Cause and Effect from Non-Formatted Text Data in Hours, Instead of Years.

AI Relics Application: A Freely Available Edition of Anthropic Relics Capable of Analyzing Python Script, Creating HTML/CSS/JS and Next.js Code.

+60 12-462 2768

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

All Categories

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

Electrical Engineering & Computer Science (eecs)(430)

Machine learning(1188)

News(748)

Research(613)

School of Engineering(648)

Artificial Intelligence(2794)

Computer science and technology(559)

Data(164)

ZebraLogic: An AI Benchmark Created for Assessing Language Models through Logical Puzzles

Leave a comment Cancel reply

You May Also Like

Could the Next Medical Innovation be Concealed in Simple Text? Introducing NATURAL: A Procedure for Inferring Cause and Effect from Non-Formatted Text Data in Hours, Instead of Years.

AI Relics Application: A Freely Available Edition of Anthropic Relics Capable of Analyzing Python Script, Creating HTML/CSS/JS and Next.js Code.

+60 12-462 2768

All
Categories

All
Categories

All
Categories