Generative AI, which can create text and images, also has extensive potential in creating realistic synthetic data for various applications. Being able to produce synthetic data can assist organizations, particularly in situations where real-world data is lacking or sensitive. For instance, it can help in patient care, rerouting of flights due to adverse weather, or software platforms’ enhancements.
An MIT startup called DataCebo has developed a generative software system, The Synthetic Data Vault (SDV), that enables organizations to create synthetic data. This can be put to use in performance testing of software applications and training of machine learning models, amongst other applications.
Considering its capability to transform software testing, the SDV system has witnessed considerable success. It has been downloaded more than a million times, and over 10,000 data scientists have used this open-source library for creating synthetic tabular data. The system gained popularity following the unveiling of a series of open-source generative AI tools by the Data to AI Lab in 2016. These tools could produce synthetic data matching the statistical properties of real data thus they could be used in programs while preserving the statistical relationships between data points.
Synthetic data has found various applications; airlines can simulate rare weather events to plan better, medical records can be synthesized to predict health outcomes, and it has even been used to create synthetic student data to evaluate the fairness of admission policies.
DataCebo’s future priorities lie in enhancing its impact in software testing. The traditional process of creating synthetic data manually for testing is time-consuming—a predicament that SDV can resolve by sampling a large volume of synthetic data possessing similar properties to the real data.
DataCebo has also been instrumental in improving the field of synthetic enterprise data, a term it uses to describe data generated from user behavior on large firms’ software applications. “Enterprise data of this kind is complex and there’s no universal availability of it, unlike language data,” explains Veeramachaneni. He further adds, “When folks use our publicly available software and report back what works on a certain pattern, we learn a lot of these unique patterns, and it allows us to improve our algorithms.”
In addition to these advancements, DataCebo has also released tools to assess the “realism” of the generated data and to compare models’ performances.
The ultimate goal of DataCebo is to encourage transparency and responsibility among companies that are rushing to adopt AI and other data science tools. They forecast that in the future, synthetic data generated from AI models will revolutionize all realms of data work. They strongly believe that 90% of enterprise operations could be powered by synthetic data.