Generative AI, which can create text and images, is becoming an essential tool in today’s data-driven society. It’s now being utilized to produce realistic synthetic data, which can effectively solve problems where real data is limited or sensitive. For the past three years, DataCebo, an MIT spinoff, has been offering a Synthetic Data Vault (SDV) software system to help organizations generate synthetic data in applications such as software testing and machine learning model training.
The SDV, given its ability to transform software testing, has been downloaded over a million times and used by more than 10,000 data scientists. Kalyan Veeramachaneni and Neha Patki, the founders, credit its success to the innovative approach offered by synthetic data creation. In 2016, Veeramachaneni’s team revealed an open-source suite of tools to create synthetic data, preserving statistical relationships in the absence of sensitive information and allowing software testing through simulations.
DataCebo was founded in 2020 to add more features for larger organizations to SDV. Its uses have since been impressively wide-reaching, from planning for rare weather events in airlines to predicting health outcomes for cystic fibrosis patients based on synthesized medical records. The data science platform Kaggle hosted a competition in 2021, engaging around 30,000 data scientists to use SDV for synthetic data creation and prediction outcomes, thereby avoiding proprietary data use.
Software testing remains a significant focus, as the developers can use generative models created using SDV to understand sample collected data and generate a large volume of synthetic data to simulate specific scenarios for application tests. This approach lets the developers efficiently test edge cases, like a bank simulating accounts with no money for testing transfer rejection.
Data protection concerns have also brought synthetic data into the spotlight as it is always better from a privacy perspective. Companies can restrict data access while adhering to both regulations and their best interests. The company views its SDV as a crucial tool for advancing synthetic enterprise data, data sourced from user behavior on large-scale software applications. Veeramachaneni notes that, unlike language data, enterprise data is complex and not universally available.
Recently, DataCebo introduced new features to improve SDV’s utility. These include tools called the SDMetrics library to measure generated data realism and SDGym, which helps compare model performance. Veeramachaneni emphasizes their goal is to ensure organizations can trust this new data by providing programmable synthetic data that allows enterprises to insert specific insights and intuition for more transparent models.
According to Veeramachaneni, synthetic data via generative models will revolutionize data work in the next few years as an increasing number of companies adopt AI and data science tools. He believes that a whopping 90% of enterprise operations could be performed with synthetic data. Given this, DataCebo is playing a pivotal role in promoting transparent and responsible technology adoption.