GenSQL, a new AI tool developed by scientists at MIT, is designed to simplify the complex statistical analysis of tabular data, enabling users to readily understand and interpret their databases. To this end, users don’t need to grasp what is happening behind the scenes to develop accurate insights.
The system’s capabilities include making predictions, identifying anomalies, estimating missing values, rectifying errors, and generating synthetic data. As an example, GenSQL could easily flag a low blood pressure reading for a patient who consistently had high blood pressure. Even though the reading may fall within a normal range, it’s unusual for the concerned individual.
GenSQL marries a dataset and a generative probabilistic AI model. The advantage is that the system can factor in uncertainty and align decision-making in line with new data inputs. The tool can be useful in cases where data sharing is restricted (as with sensitive patient health records) or where real data is sparse. It can generate and scrutinize synthetic data that resembles the real data in a database.
The research team incorporated the GenSQL tool on top of SQL, a widely-used database creation and manipulation programming language developed in the late 1970s. An important feature of the new tool is its ability to explain probabilistic models, allowing users to read and edit them.
Comparisons with existing AI-based data analysis approaches proved GenSQL to be faster and more accurate. The research team created the tool as there was no efficient way to integrate probabilistic AI models in SQL. Conversely, approaches that use probabilistic models to make inferences did not support complex database queries.
With GenSQL, a user can integrate their data and probabilistic model, and run queries that, relative to the probabilistic model, run behind the scenes. The result is not only more complex queries, but also more accurate responses.
The GenSQL tool was evaluated against traditional methods that use neural networks. The researchers found it to be between 1.7 to 6.8 times faster while delivering more accurate results. In two different tests of its real-world application, GenSQL successfully identified mislabeled clinical trial data and accurately generated synthetic data that reflected the intricate relationships within genomics.
The successful results have prompted the researchers to dream of wider applications that can model entire human populations on a large scale. They seek to extend the tool’s convenience and add new features, including natural language queries. Their ultimate goal is to develop an AI expert akin to chatGPT that users can turn to with queries about any database. This expert would ground its responses by utilizing GenSQL queries.