The article introduces a benchmark known as ZebraLogic, which assesses the logical reasoning capabilities of large language models (LLMs). Using Logic Grid Puzzles, the benchmark measures how well LLMs can deduce unique value assignments for a set of features given specific clues. The unique value assignment task mirrors those that are often found in assessments such as the Law School Admission Test (LSAT).
The article presents an example of a 2×3 Logic Grid Puzzle with two houses and three features: names, car models, and animals. Through a process of logical deduction, solutions are achieved—for instance, Eric lives in House 1, owns a Ford F150, and keeps horses. Such an example illustrates the process required to arrive at a solution through logical deduction—a skill that ZebraLogic seeks to measure.
ZebraLogic consists of 1,000 programmatically generated puzzles, and the large language models are tested using a one-shot example approach. Models are provided with the steps to reasoning and a solution in a JSON-formatted example, after which they are required to output their reasoning process and answers in a similar JSON format. This allows for a standardised evaluation of the abilities of the models across various puzzle complexities.
ZebraLogic uses two primary metrics: puzzle-level accuracy and cell-wise accuracy. For a puzzle of size NxM, cell-wise accuracy measures the percentage of correctly filled cells out of the total number of NxM cells. Puzzle-level success, on the other hand, requires all cells to be correct. Difficulty and complexity of these puzzles increase with the size of the puzzle.
According to the study, LLMs generally underperformed in logical reasoning tasks. Claude 3.5 Sonnet, one of the models, achieved an overall accuracy of only 33.4% and 12.4% on complex puzzles. The deficit of abilities in counterfactual thinking, reflective reasoning, structured memorisation, and compositional generalisation were found to be factors contributing to the underperformance of LLMs. Contrastingly, human performance varies with the size of the puzzle, ranging from 15 seconds for a 2×2 puzzle to 10-15 minutes for a 4×4 puzzle.
The article concludes by describing the systematic steps involved in creating these logic grid puzzles, from defining features to formatting puzzles for LLM input. Finally, it states that LLMs struggle with complex logical reasoning despite the best of them only solving 33.4% of all puzzles and 12.4% of the challenging ones. Therefore, it seeks to highlight these challenges of logical reasoning and initiate further research and improvements in the field.