Generative AI is increasingly being utilized to generate synthetic data, enhancing organizations’ abilities to deal with situations where actual data may be limited or sensitive. Over the past three years, DataCebo, an MIT spinoff initiative, has been offering a generative software system known as the Synthetic Data Vault (SDV) to enable organizations to create synthetic data for diverse applications such as testing software and training machine learning models. More than 10,000 data scientists have used SDV, which has been downloaded over 1 million times, for generating synthetic tabular data.
Open-source generative AI tools were established in 2016 by a group at the Data to AI Lab, spearheaded by Principal Research Scientist Kalyan Veeramachaneni and alumna Neha Patki, to assist organizations in creating synthetic data to resemble the statistical properties of real data. Synthetic data are integral for companies wanting to use data simulations, without compromising sensitive information, to test new software prior to its public release.
In 2020, Veeramachaneni and Patki founded DataCebo to expand SDV’s features for larger organizations. DataCebo’s novel use cases range from airlines using a new flight simulator for planning rare weather events to synthesizing medical records for predicting patient health outcomes, and generating synthetic student data to assess the equity and bias of admission policies.
The global data science community platform Kaggle held a competition in 2021 that leveraged SDV to generate synthetic data sets, a strategy adopted to circumvent the use of proprietary data. Approximately 30,000 data scientists participated in the event.
DataCebo continues to focus on enhancing software testing capabilities, with generative models created using SDV helping to create specific scenarios and edge cases for application testing, saving time and promoting privacy.
As large-scale software applications increasingly generate synthetic data as a result of user activities, DataCebo is leading the charge in developing the field of “synthetic enterprise data”. The company continues to refine its algorithms and has introduced new tools to assess the quality of synthetic data and measure model performances. Synthetic data, according to Veeramachaneni, will revolutionize data work, and he predicts that 90% of enterprise operations could rely on synthetic data in the future.