Machine Translation (MT), part of Natural Language Processing (NLP), aims to automate the translation of text from one language to another using large language models (LLMs). The goal is to improve translation accuracy for better global communication and information exchange. The challenge in improving MT is using high-quality, diverse training data for instruction fine-tuning, which ensures the models can generalize across different contexts and languages. Existing methods for enhancing MT performance include in-context translation exemplar selection, prompt optimization, and decoding strategies using models and frameworks. Among these, models like GPT-4, Bayling-13B, BigTranslate-13B, TIM, and NLLB-54B stand out for their focus on tuning instructions and improving translation performance by leveraging comprehensive datasets and sophisticated evaluation metrics.
Researchers from ByteDance Research have introduced a novel method called G-DIG to select high-quality, diverse instruction data for machine translation. It uses gradient-based techniques to investigate the impact of individual training examples on model performance, aiming to improve data selection without relying on external models. This method enhances the quality and diversity of the training datasets.
The G-DIG involves two main steps: selecting high-quality data and enhancing diversity. Researchers manually create a small set of high-quality seed data, using influence functions that identify training examples positively affecting model performance. The quality of each training sample’s response is measured using the influence score on test instances. To bolster diversity, researchers apply clustering algorithms to the gradients of training examples, ensuring a variety of influences on the model. They use the Euclidean distance measure to assess gradient similarity and employ the K-means clustering algorithm to group training data into diverse patterns.
Experiments on various translation tasks have shown that G-DIG outperforms existing data selection methods and competes with state-of-the-art models. For instance, in Chinese-to-English (Zh → En) and German-to-English (De → En) translation tasks, G-DIG surpassed other models in quality. The researchers note that models trained with G-DIG-selected data exhibited better translation quality and alignment with human expectations.
In conclusion, ByteDance Research’s G-DIG method addresses the challenges of data quality and diversity in machine translation. It leverages gradient-based data selection to enhance model performance without needing external quality assessment models. The success of this method demonstrates its potential to improve translation accuracy and efficiency, paving the way for more advanced and reliable machine translation systems. G-DIG’s ability to select training data that directly affects model performance ensures models are better aligned with human instructions, proving more effective in real-world applications.