Generative AI has vast potential in creating synthetic data that can mimic real-world scenarios, which in turn can aid organizations in improving their operations. In line with this, DataCebo, a spinout from MIT, has developed a generative software system referred to as the Synthetic Data Vault (SDV), which has been employed by thousands of data scientists to generate synthetic tabular data. This invention has found widespread use in fields such as software application testing and machine learning model training.
SDV has gained high popularity since its launch, being downloaded over a million times. Its founders attribute its success to its unique ability to bring about a revolution in software testing procedures. Originally, SDV was developed as an open-source generative AI tool capable of creating synthetic data akin to real data in their statistical properties. This property of SDV allows companies to utilize synthetic data in place of sensitive data, thereby preserving the statistical relationships between data points. It also facilitates businesses in assessing the performance of new software via simulations before making it public.
In 2020, DataCebo was established by the researchers to advance the features provided by SDV to accommodate larger organizations. The company’s flight simulator prototype allowed for innovative uses such as enabling flight companies to prepare for rare weather events using predictive models. Similarly, in healthcare, the model was employed to predict health outcomes for cystic fibrosis patients based on synthesized medical records.
DataCebo remains focused on enhancing its adoption in the field of software testing. The company emphasizes the crucial role that data plays in the process of software testing and how employing generative models, created using SDV, can facilitate the development of software applications more efficiently.
The use of synthetic data is further advocated by Patki due to the sensitive nature of certain industries’ data and the need for stringent privacy norms. Hence, synthetic data offers a better alternative from a privacy perspective.
DataCebo is continually innovating within the domain of synthetic enterprise data. The company recently introduced features to SDV aimed at evaluating the “realism” of the generated data while also offering avenues for the comparison of models’ performances. Such initiatives are aimed at building trust within organizations towards this new form of data.
As the adoption of AI and data science tools accelerates across various sectors, DataCebo is paving the way for them to transition in a manner that is both transparent and responsible. The company firmly believes that synthetic data generated from generative models will revolutionize data work in all enterprises in the coming years.