We are excited to announce the launch of Microsoft researchers’ revolutionary new approach to generate diverse, high-quality instruction data from open-source code, thereby improving the effectiveness of instruction tuning and the generalization ability of fine-tuned models. This method involves classifying instruction data into four universal code-related tasks and introduces a Language Model (LLM) based Generator-Discriminator data processing framework called CodeOcean.
The goal is to enhance the performance of Code LLMs through instruction tuning. This research study also introduces WaveCoder, a fine-tuned Code LLM with Widespread And Versatile Enhanced instruction tuning. WaveCoder is designed to maximize instruction tuning for Code LLMs and exhibits superior generalization ability across different code-related tasks compared to other open-source models at the same fine-tuning scale.
The research introduces the concept of alignment, wherein pre-trained models, having learned from self-supervised tasks, can comprehend text inputs. Instruction tuning provides instruction-level tasks, allowing pre-trained models to extract more information from instructions and enhance their interactive abilities with users.
The proposed LLM Generator-Discriminator framework leverages source code, explicitly controlling data quality during the generation process. The method generates more realistic instruction data by taking raw code as input and selecting a core dataset while controlling data diversity through raw code distribution adjustments.
WaveCoder models are evaluated across code generation, repair, and summarization tasks, showcasing its effectiveness in diverse scenarios. A comparison with the CodeAlpaca dataset highlights CodeOcean’s superiority in refining instruction data and enhancing the instruction-following ability of base models. WaveCoder models consistently outperform other models on various benchmarks, including HumanEval, MBPP, and HumanEvalPack.
The research emphasizes the importance of data quality and diversity in the instruction-tuning process. The findings of this study present great opportunities for a range of tasks that require instruction-level understanding, such as natural language processing, code generation, and code repair. This research has the potential to open up new possibilities in LLM-based instruction understanding and instruction-level tasks.
We are thrilled to witness the powerful potential of instruction tuning in improving model capabilities for a range of tasks with the introduction of CodeOcean and WaveCoder. We are excited to see what the future has in store for this technology!