Skip to content Skip to footer

Generative AI, renowned for its capability to autonomously produce text and images, plays a crucial role in creating realistic synthetic data from diverse scenarios, helping organizations optimize operations. A notable initiative in the field is the Synthetic Data Vault (SDV), developed by DataCebo, an MIT spinoff. This generative system aids organizations in creating synthetic data for purposes such as software testing and machine learning model training.

The SDV’s utility has led to it being downloaded over one million times and used by more than 10,000 data scientists for generating synthetic tabular data. The software’s founders, Kalyan Veeramachaneni and Neha Patki, attribute the success of the company to SDV, which they believe is capable of transforming software testing.

In 2016, under Veeramachaneni’s leadership, the Data to AI Lab introduced a suite of open-source generative AI tools that helped organizations produce synthetic data that mirrored the statistical properties of real data. This significant move gave companies the possibility of using synthetic data that protected sensitive information while keeping statistical relationships between data points intact. It also facilitated running new software through simulations to gauge performance before public releases.

In 2020, the founders established DataCebo to further SDV’s functionality for larger firms. Notably, SDV has shown its mettle in multiple sectors. For instance, airlines now use it as a tool to prepare for extraordinary weather events. It has also been used to simulate medical records to forecast health outcomes for patients with cystic fibrosis and to develop unbiased and meritocratic admission processes in educational institutions.

DataCebo’s immense popularity led to a 2021 data science competition on Kaggle, using SDV to create synthetic datasets by around 30,000 data scientists. The company, true to its MIT roots, primarily consists of MIT alumni.

Despite the wide-ranging applications, DataCebo’s chief emphasis lies in accelerating software testing. Veeramachaneni stated that synthetic data generated from tools like SDV radically enhanced the efficiency and effectiveness of software testing. Financial institutions, for instance, can use it to create the ‘edge cases’ necessary to test a system’s response to unusual circumstances, such as transferring money from empty accounts.

Synthetic data’s perks extend beyond operational efficiency to entail strong privacy benefits, a crucial feature in a regulation-laden business environment dealing with sensitive data.

According to Veeramachaneni, DataCebo is pushing the boundaries of what he dubs ‘synthetic enterprise data’, the data generated by users interacting with large-scale software applications. Such data is unique and complex, and its understanding and generation can significantly improve AI algorithms.

To buttress the utility of synthetic data, DataCebo recently introduced tools to gauge the realism of the created data, known as the SDMetrics library, and compare models’ performance through the SDGym.

Veeramachaneni sees the future of data work being revolutionized by synthetic data from generative models, with roughly 90% of enterprise operations potentially adopting the nascent technology. DataCebo’s consistent push to render data science practices more transparent and accountable is integral to this process, accentuating the rise of AI and data science utilities across industries.

Leave a comment

0.0/5