Google AI researchers are working towards generating high-quality synthetic datasets while ensuring user privacy. The increasing reliance on large datasets for machine learning (ML) makes it essential to safeguard individuals’ data. To resolve this, they use differentially private synthetic data, new datasets that are completely artificial yet embody key features of the original data.
Existing privacy-preserving data generation processes use differentially private machine learning (DP-ML) algorithms, guaranteeing robust privacy. But these methods can be computationally demanding and may not always yield high-quality results. Earlier models have used large language models (LLMs) paired with differentially private stochastic gradient descent (DP-SGD) to generate private synthetic data. The synthetic data produced with this method does not reveal specific information about the individuals in the sensitive dataset.
But Google’s researchers are proposing an improved approach to generate differentially private synthetic data. They utilize parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRa) and prompt fine-tuning to alter a smaller number of parameters during the private training process. This reduction in modified parameters lessens computational overhead and potentially enhances the synthetic data’s quality.
The method begins with training an LLM on a large corpus of public data. This LLM is then fine-tuned using DP-SGD on the sensitive dataset, targeting only a subset of the model’s parameters. LoRa fine-tuning replaces each W in the model with W + LR, where L and R are low-rank matrices and only L and R are trained. Prompt fine-tuning inserts a “prompt tensor” at the start of the network and only trained it’s weights, effectively modifying only the input prompt used by the LLM.
LoRa fine-tuning, which only modifies around 20 million parameters, outperforms both full-parameter fine-tuning and prompt-based tuning in empirical results. This trade-off between computational efficiency and data quality struck the balance that indicates that an optimal number of parameters exists. Classifiers trained on LoRa fine-tuned LLMs’ synthetic data outperformed those trained on synthetic data from other tuning methods, and sometimes those trained directly on the original sensitive data using DP-SGD.
The researchers tested the approach on three publicly available datasets: IMDB, Yelp, and AG News, treated as sensitive. A decoder-only LLM (Lambda-8B) was trained on public data and privately fine-tuned. Classifiers’ performance on the original data’s held-out subsets demonstrated the efficacy of the proposed method.
Google’s method, which uses parameter-efficient fine-tuning techniques to generate differentially private synthetic data, outperforms previous methods. By fine-tuning a smaller subset of parameters, computational requirements reduce, and synthetic data quality improves. Resultantly, privacy is preserved, utility for training predictive models is high, and user privacy is uncompromised, making it a valuable tool for organizations. The empirical results highlight the effectiveness of the method and its potential for broader applications in privacy-preserving machine learning.
The research paper and blog on this topic is available for further reading. This research credits a team of researchers, who are followed across multiple platforms, including Twitter, Telegram Channel, Discord Channel, and LinkedIn Group. A newsletter and a subreddit of over 42k ML enthusiasts are also available for joining.