Researchers at the University of Texas (UT) in Austin have introduced a new benchmark designed to evaluate the effectiveness of artificial intelligence in solving complex mathematical problems. PUTNAMBENCH is aimed at solving a key issue facing the sector as current benchmarks are not sufficiently rigorous and mainly focus on high-school level mathematics.
Automating mathematical reasoning in artificial intelligence has been a long-pursued goal, with computer-aided formal proof methods such as Lean 4, Isabelle and Coq playing a critical role. These methodologies allow for the creation of machine-verifiable mathematical theorems and provide a structured environment to prove intricate problems. Neural theorem provers seek to automate this process, but their development requires defined standards to evaluate how well they operate.
PUTNAMBENCH was created to address a gap in quality evaluation. It comprises problems from The William Lowell Putnam Mathematics Competition—an esteemed contest in North America for its challenging college-level mathematics problems—and has a total of 1,697 formalizations of 640 issues. Each problem is available in Lean 4 and Isabelle and several are available in Coq.
Designed to evaluate the capacities of neural theorem provers in various types of mathematical concepts, PUTNAMBENCH is considered a challenging benchmark because it assesses problem-solving abilities. The benchmark was used to test various neural and symbolic theorem provers, including Draft-Sketch-Prove, COPRA, GPT-4, Sledgehammer and Coqhammer. These methods attempted to solve problems with unique approaches, although the results indicated that modern strategies can only solve a small number of the PUTNAMBENCH problems.
A key challenge identified by the researchers is the difficulty in developing and combining new theorems into complex proofs, underlining the need for more advanced AI models that can effectively harness deep mathematical knowledge. The multilingual nature of PUTNAMBENCH further distinguishes it from other benchmarks. It allows a comprehensive evaluation of theorem-proving methods while ensuring that the benchmark can assess the capabilities of theorem provers in different environments.
Although it has proven valuable, the early research indicates that there remains a long way to go to develop neural theorem provers capable of solving complex mathematical problems. However, PUTNAMBENCH represents an important step towards this goal.