A team of researchers from MIT, Digital Garage, and Carnegie Mellon has developed GenSQL, a new probabilistic programming system that allows for querying generative models of database tables. The system extends SQL with additional functions to enable more complex Bayesian workflows, integrating both automatically learned and custom-designed probabilistic models with tabular data.
Probabilistic databases use algorithms that are adept at making inferences about discrete distributions. These databases weave probabilities into relational systems to carry out tasks such as imputation and random data generation. Unlike BayesDB, another probabilistic database system, GenSQL presents novel semantic concepts, guarantees of soundness, and increased performance and expressiveness. The system even enables nested queries and the combination of results from multiple models.
GenSQL includes features for both traditional SQL operations and probabilistic models. It uses a unique type system that ensures well-defined expressions, managing both continuous and discrete types and including specific rules for events with zero probability. The semantics of GenSQL are based on measure theory for its probabilistic aspects and provide compositional semantics for expressions. The system has been designed particularly for generating synthetic data, querying probabilistic models, and managing complicated conditional queries.
Performance tests of GenSQL, which is a Clojure-based probabilistic SQL extension, compared it with similar systems. Using models generated through ClojureCat, researchers found that GenSQL markedly outperformed BayesDB across ten benchmark queries, providing speed increases ranging from 1.7 to 6.8 times. The enhanced performance was due to the system’s efficient ClojureCat backend and strategic optimizations, such as caching and the exploitation of column independence.
In summary, GenSQL revolutionizes probabilistic programming by focusing on applications with tabular data, setting itself apart from other general-purpose PPLs in numerous ways. It paves the way for workflows across various languages via its AMI, allowing different modules to be smoothly integrated. GenSQL also introduces an approach to make querying easier by combining probabilistic models with database operations. It allows performance enhancements to be reused like in traditional DBMS, boosting efficiency across varied domains without needing specific optimizations. This innovative system has potential applications for creating synthetic data and developing modular queries, which can aid in the efficient and scalable use of generative models in data analysis.