IBM researchers are working on addressing the challenge of digging out beneficial insights from large databases, a problem often encountered in businesses. The volume and variety of data are overwhelming and can pose a significant challenge for employees to find the necessary information. Writing SQL codes, needed to retrieve data across multiple programs and tables, is intricate, thereby hindering businesses from making tactical decisions while utilizing their data to the fullest.
Today, SQL is the most common language used for database interactions, but SQL knowledge is typically limited within an organization to a small group of data professionals. Thus, the general access to data insights is restricted. To counter this problem, IBM researchers proposed a new model named ExSL+granite-20b-code. This technology simplifies data analysis by equipping generative AI to write SQL queries starting from natural language questions. This model has shown top performance on the BIRD benchmark, a tool measuring the effectiveness of AI models in converting natural language into SQL.
The ExSL+granite-20b-code model uses an extractive schema-linking technique to understand database organization and find the relevant data tables and columns. The team developed three versions of the Granite 20B model to enhance the process of identifying the relevant data columns, creating linkages between value data, and generating SQL code accurately.
IBM applied a three-step process for improving text-to-SQL generation, including schema linking, content linking, and SQL code generation. In the schema linking phase, the relevant data tables and columns are matched with the keywords in the question. An extractive method accelerates this process considerably. During the content linking phase, another model instance trained to generate multiple pieces of SQL code converts sub-tables into string representations and analyses them. Lastly, the third instance of the Granite model generates and chooses the best SQL queries based on their execution results.
In the BIRD benchmark, IBM’s solution showed top performance in terms of both accuracy and execution speed. It scored an 80 in code execution speed, which was just slightly lower than the 90 achieved by human engineers, whereas other AI systems scored at 65. The system’s extractive method for schema linking and generative process for content linking were the crucial factors behind this outstanding performance. Although the system could answer only 68% of the questions correctly as compared to a 93% score of human engineers, its performance marks a significant progress in the field of automating SQL generation.
In conclusion, IBM’s advancements in utilizing generative AI to simplify data querying methods for businesses are notable. Despite answering only 68% of the questions correctly compared to human engineers’ 93%, the new text-to-SQL generator developed by IBM presents a promising solution to the requirement for SQL proficiency in businesses, thereby allowing wider access to data insights.