Generative AI has the capability to produce realistic synthetic data that could help organizations in various sectors such as healthcare, aviation, and software development efficiently carry out operations. For the last three years, MIT spinout DataCebo has been offering the Synthetic Data Vault (SDV), a generative software system that can design synthetic data, useful in testing software applications and training machine learning models. So far, SDV has been downloaded over a million times and utilized by over 10,000 data scientists. The success of DataCebo arises from the capacity of SDV to revolutionize software testing, an industry of increasing importance.
The making of SDV was kickstarted by the Data to AI Lab that wanted to help organizations produce synthetic data with statistical characteristics closely matching real data. Synthetic data has the advantage of maintaining the statistical associations among data points, allowing companies to use it rather than sensitive data in programs. Besides, it makes it possible for firms to run newly developed software through simulations to assess its performance before it can be released to the public.
DataCebo was established in 2020 to build additional SDV features for larger organizations. The application of SDV has been incredibly versatile. For instance, a new flight simulator by DataCebo allows airlines to prepare for rare weather situations, using SDV-generated synthetic data as opposed to relying solely on historic data. SDV has also been used to produce synthetic medical records to predict health outcomes for patients with cystic fibrosis. In education, SDV was utilized to create synthetic student data to determine whether certain admission policies were merit-based and unbiased.
DataCebo has remained faithful to its MIT roots; all of its employees are MIT Alumni. While DataCebo’s open-source tools can be adapted for a variety of uses, the company is focused on expanding its influence in software testing. Traditionally, developers manually write codes to produce synthetic data. But with generative models created using SDV, they can learn from a sample of collected data and generate a large volume of synthetic data, retaining the properties of the real data, for testing purposes. For instance, a bank can simulate several accounts simultaneously transacting to test a program designed to reject transfers from cash-strapped accounts using DataCebo’s generative models.
DataCebo also recently released features that enhance SDV’s utility. These features include tools that assess the “realism” of the generated data, and a mechanism to compare the performance of models. These tools are intended to promote confidence in this new data. Veeramachaneni believes that synthetic data produced from generative models will transform all data work, with the potential to support 90 percent of operations in enterprises. Thus, the work of DataCebo is helping enterprises adopt AI and other data science tools more responsibly and transparently.