Researchers from the Language Technologies Institute at Carnegie Mellon University and the Institute for Interdisciplinary Information Sciences at Tsinghua University have developed a groundbreaking framework – Lean-STaR – that bridges informal human reasoning with formal proof generation to improve machine-driven theorem proving. This research seeks to utilize the potential of integrating natural language thought processes into formal proofs, which has previously been overlooked in traditional theorem-proving methods.
Traditional techniques tend to concentrate solely on formal proof data, leading to gaps in harnessing the full power of natural language reasoning. This often leads to the limitation in existing language models dedicated to generating tactics in formal mathematics as they are unable to fully leverage the benefits of thought augmentation.
Lean-STaR, however, introduces an innovative approach where it incorporates informal thoughts before formal proof steps, combining the benefits of both formal and informal mathematics. This is achieved by using retrospective ground-truth tactics and expert iteration to train language models, effectively creating a thought-augmented dataset that is instrumental in enhancing the theorem-proving capabilities of these models.
Unlike the Draft-Sketch-Prove method which showed promise in Isabelle but faced challenges in Lean due to constraints in powerful automatic proofing tools, the Lean-STaR approach effectively advances automated theorem proving in Lean’s environment. This is particularly beneficial for progress in mathematics and artificial intelligence development.
In the Lean-STaR experiment, the LeanDojo Benchmark 4 dataset was used with a constrained tactic count and a time limit. Successful examples from the sampling were saved and used for model training, resulting in over 32,000 proofs. The fine-tuning process involved combining GPT-4 annotated reasoning data and expert iteration data. The resulting Lean-STaR model saw significant qualitative and quantitative improvements in theorem-proving capabilities, including a rise in the miniF2F-test benchmark from 43.4% to 46.3%.
Despite the promise the approach shows, the researchers noted certain limitations including computational scalability, potential biases from the GPT-4 generated data, and bottlenecks due to Lean ITP’s speed. Nevertheless, the study found that Lean-STaR significantly enhanced language models’ theorem-proving abilities and led to new state-of-the-art results.
The potential of this framework offers new opportunities to advance mathematical understanding and improve the efficiency and accuracy of automated theorem proving. The success of this research highlights the importance of integrating informal thoughts into formal proof generation.