AI researchers at Google have developed a new approach to generating synthetic datasets that maintain individuals’ privacy, essential for training predictive models. With machine learning models relying increasingly on large datasets, ensuring the privacy of personal data has become critical. They achieve this privacy through differentially private synthetic data created by generating new datasets that mirror the original features but are entirely artificial, thus maintaining user privacy and enabling effective model training.
Previous privacy-preserving data generation methods utilized differentially private machine learning (DP-ML) algorithms, which provide robust privacy protection. However, this method can be computationally demanding with high-dimensional datasets and may not always yield high-quality results. To overcome these challenges, Google researchers proposed an advanced approach to producing differentially private synthetic data using parameter-efficient fine-tuning techniques such as LoRa (Low-Rank Adaptation) and prompt fine-tuning. These techniques aim to modify a smaller number of parameters during the private training, reducing computational demands and potentially enhancing the synthetic data.
The new approach involves training large-language models (LLMs) on a large corpus of public data, then fine-tuning these models using Differentially Private Stochastic Gradient Descent (DP-SGD) on a sensitive dataset. The fine-tuning process is limited to a specific subset of the model’s parameters. For LoRa fine-tuning, each weight in the model (W) is replaced with W + LR, where L and R are low-rank matrices, and only L and R are trained. Prompt fine-tuning involves adding a “prompt tensor” at the network’s start and training its weights, effectively modifying only the input prompt used by the LLM.
Empirical results revealed that LoRa fine-tuning outperforms both full-parameter fine-tuning and prompt-based tuning. Test classifiers trained on synthetic data yielded by LoRa fine-tuned LLMs surpassed those trained on synthetic data from other fine-tuning methods. Researchers’ experiments involved training a large model (Lambda-8B) on public data and then privately fine-tuning it on three publicly available datasets. The synthetic data generated was used to train classifiers for tasks such as sentiment analysis and topic categorization, and the results proved the effectiveness of the proposed method.
In conclusion, Google’s innovative method of generating differentially private synthetic data using parameter-efficient fine-tuning techniques has excelled compared to existing methods. By tuning a smaller subset of parameters, this approach cuts computational requirements and enhances the quality of synthetic data, preserving privacy and maintaining a high utility for training predictive models. The empirical results underline the effectiveness of the proposed method and suggest its potential for wider applications in privacy-preserving machine learning.