Generating synthetic data is becoming an essential part of machine learning as it allows researchers to create large datasets where real-world data is scarce or expensive to obtain. The created data often display specific characteristics that benefit machine learning models’ learning processes, helping to improve performance across various applications. However, the usage of synthetic data also raises concerns over potential biases and attributes that could influence the models’ behavior, indicating that it’s crucial to understand how these characteristics can affect large language models (LLMs).
Within the data space, there are current methods being used that include data augmentation, pseudo-labeling, data weighting, data pruning, and curriculum learning. To aid machine learning models, these methods add new attributes, though the ability to enhance specific characteristics is limited by the inherited properties of the original datasets. To counteract this, researchers from Cohere for AI and Cohere have proposed a unique concept known as “active inheritance.”
The approach of active inheritance is intended to guide synthetic data generation towards non-differentiable objectives such as high lexical diversity and low toxicity. The method uses targeted sampling involving selecting proxy labels based on desired characteristics, generating multiple samples for each prompt, then choosing the sample that maximizes the desired attribute. This way, models can be explicitly fine-tuned to certain goals using synthetic datasets selected to enhance these characteristics.
The initial outcomes of using active inheritance have shown substantial improvement, with models demonstrating a 116% increase in length, 43% enhancement in linguistic diversity, and up to 40% reduction in toxicity. These results suggest the potential for active inheritance to increase the safety and quality of language models.
The researchers also considered the impacts of passive inheritance, defined as when models inherit properties from synthetic data without any active guidance. The models were found to be sensitive to their training data’s properties, raising renewed concerns about unintentional biases and highlighting the key role of curating synthetic data.
In the end, the research underscores that synthetic data can strongly influence the attributes of large language models. By presenting active inheritance, Cohere’s researchers provide a strong framework that guides synthetic data generation towards desired features, thereby advancing certain attributes such as lexical diversity and lowered toxicity. This ensures that the resulting models trained with synthetic data are both safe and effective. The findings indicate that the desired attributes can be efficiently established within a model’s generation with minimum effort. Consequently, active inheritance has emerged as a promising technique for optimizing machine learning models, providing a pathway towards more sophisticated and reliable AI systems.