IBM researchers have taken a major step toward simplifying the process of extracting valuable insights from large business databases. Currently, these databases are queried using Structured Query Language (SQL), a dominating language for database interactions. However, SQL proficiency typically lies within a small group of data professionals, presenting a barrier to broader data access and interpretation.
The challenge lies in the complexity of writing SQL code that retrieves data across multiple schemas or tables. This complexity constrains businesses by limiting their capacity to leverage their data for strategic decisions. IBM’s solution to this challenge is ExSL+granite-20b-code, a Granite code model that makes data analysis simpler by enabling a generative AI to write SQL queries from natural language questions or commands.
The researchers incorporated an extractive schema-linking method for the ExSL+granite-20b-code model, enabling it to understand the database organization and access the necessary data tables and columns. The model was calibrated into three versions to enhance the process of identifying related data columns, forming connections between data values, and generating precise SQL code.
The three-step method adopted by IBM to enhance text-to-SQL generation encompasses schema linking, content linking, and SQL code generation. The initial process of schema linking entails matching keywords in natural language questions to relevant data tables and columns. The technique of using an extractive approach drastically expedites this process. In content linking, sub-tables are converted into string representations and run through another model built to generate SQL code fragments. This phase involves comparing columns with values related to the query. The final stage involves the third instance of the Granite model evaluating and selecting the best SQL queries by interpreting execution results.
According to the BIRD benchmark, IBM’s solution for automating SQL generation excelled in accuracy and speed. It registered a score of 80 in speed, coming close to the 90 scored by human engineers and impressively outperforming other AI systems which averaged 65. This achievement was largely attributed to the extractive method for schema linking and the generative approach applied during content linking.
Despite registering lower rates of correctly answered SQL queries compared to humans, the model’s performance signifies an impactful advancement in the automation of SQL code writing. This new model serves a bigger purpose than simply increasing the efficiency of data querying processes, it enhances businesses’ accessibility to valuable data, which can be critical in the decision-making process. Despite its limitations, IBM’s text-to-SQL generator presents a promising solution for firms that seek to make their operations more data-driven.