Generative AI, recognized mainly for its ability in creating text and images, is now used by companies to create synthetic data for various scenarios aiding in patient treatment, plane rerouting, and software improvements especially in situations that lack real-world data or require sensitive data. DataCebo, an MIT spinoff, has invented a generative software system called the Synthetic Data Vault (SDV) to assist organizations in generating synthetic data for software application testing and machine learning model training.
Over the last three years, SDV has been downloaded by 10,000+ data scientists and has been used for generating synthetic tabular data upwards of a million times. A flight simulator developed by DataCebo enables airlines to plan for rare weather events using synthetic data instead of relying solely on historical data. DataCebo’s realistic synthetic medical records were used to predict health outcomes for patients with cystic fibrosis. A team from Norway used SDV to create synthetic student data to assess the meritocracy and bias in various admissions policies.
In a 2021 competition hosted by data science platform Kaggle, around 30,000 data scientists used SDV to create synthetic datasets, thus refraining from the use of proprietary data. DataCebo aims at revolutionizing software testing; developers often manually write scripts to create synthetic data for software application testing, but SDV helps developers create a large volume of synthetic data sharing the same properties as real data. This method facilitates effective testing by creating edge cases and specific scenarios. Synthetic data is preferred as it resolves the predicament of dealing with sensitive data and related regulations.
DataCebo is advancing the creation of synthetic enterprise data, which refers to data generated from user behavior on large-scale software applications. They have released features to enhance SDV’s effectiveness, providing tools to gauge the “realism” of the created data and to compare model performances. They seek to assist companies in adopting AI and other data science tools in a more transparent and responsible manner, maintaining that synthetic data from generative models will revolutionize data work, with 90 percent of enterprise processes likely to be conducted with synthetic data.