Coding execution is a crucial skill for developers and is often a struggle for existing large language models in AI software development. A team from Google DeepMind, Yale University, and the University of Illinois has proposed a novel approach to enhancing the ability of these models to reason about code execution. The method, called “Naturalized Execution Tuning” (NExT), equips large language models to interpret and utilise execution traces – detailed data of a program’s runtime behaviour.
Unlike traditional models, NExT incorporates execution traces, or snapshots of the state of a system at a particular point in its runtime, directly into model training. This way, NExT fosters a deeper understanding of code based on its semantic components and facilitates a more nuanced reasoning process for the models. By embedding execution traces as inline comments, NExT allows these models to consider vital context that other models often overlook.
NExT’s methodology also stands out through its use of an iterative self-training loop, which progressively refines the model’s ability to generate execution-aware rationales. At the outset, the method combines proposed code fixes with synthesized execution traces in a dataset, detailing variable states and changes during execution. Using Google’s PaLM 2 model, NExT evaluates its performance in programming tasks such as code repair. Through repeated iterations, the model’s accuracy progressively improves.
Proven effective in practical applications, NExT applied to the PaLM 2 model resulted in a 26.1% increase in the fixed rate on the Mbpp-R dataset as well as a 14.3% improvement on HumanEval Fix-Plus. Moreover, the quality of generated rationales, crucial for explaining code fixes, was noticeably improved based on automated metrics and human evaluations.
In conclusion, NExT has significantly improved models’ ability to understand and rectify code. By integrating execution traces into training, NExT drastically increases the fix rates and rationale quality in complex programming tasks. Its practical impact on enhancing the accuracy and reliability of automated program repair demonstrates its potential to revolutionise software development practices.