The task of translating natural language queries (text-to-SQL) into SQL has been historically challenging due to the complexity of understanding user questions, database schemas, and SQL production. Recent innovations have seen the integration of Pre-trained Language Models (PLMs) into text-to-SQL systems, which have displayed much promise. However, they can generate incorrect SQL due to growing complexity of databases and related user queries. This necessitates more complex and specialized optimization methods. Recent developments have shown Large Language Models (LLMs), especially larger ones, to be proficient at understanding natural language, suggesting their wider use in text-to-SQL research.
The process of implementing LLM-based text-to-SQL involves three key areas. Firstly, the system must understand the user’s question in natural language and intend an SQL query to match it. Second, the system must comprehend the schema that describes the database’s structure and figure out the components relevant to the user’s query. Finally, the system uses this understanding to create an SQL query that can provide the desired answer by predicting the correct syntax.
Despite significant advancements, text-to-SQL continues to face several challenges, mainly due to the ambiguity and complexity of natural language. Due to varying terminology, schema structure, and question patterns, models often fail to generalize across domains. Others fail to accurately comprehend complicated database schemas or produce SQL queries involving complex or unusual operations due to a lack of exposure in training data. Most of these models, however, can be effectively adapted with minimal domain-specific training.
Over its evolution, text-to-SQL has grown from rule-based to deep learning-based methodologies, and most recently, into PLMs and LLMs integration. Rules-based systems used hand-crafted rules and heuristics to convert user queries into SQL. While impressive, these systems required a level of generalizability and flexibility. Deep learning methodologies leveraged LSTM and transformer deep neural networks to generate SQL queries from plain English, improving their ability to handle complex queries and generalize across domains. PLM-based methods offered better optimization for text-to-SQL tasks using models like BERT and RoBERTa. LLMs such as the GPT series have shown considerable potential in SQL generation, particularly with timely engineering and fine-tuning.
In evaluating text-to-SQL strategies, datasets are categorized as either “Original” or “Post-annotated,” and both types utilize cross-domain data to emulate real-world applications. There are also Knowledge-augmented sets (like BIRD and Spider-DK), Context-dependent (like SParC and CoSQL), Robustness databases (like Spider-Realistic and ADVETA), and Cross-lingual datasets (CSpider and DuSQL). Quality measures for Text-to-SQL include content matching metrics, like Component Matching (CM) and Exact Matching (EM), and execution-based metrics. The research and development in LLM-based text-to-SQL continue to improve, offering promising potential in enhancing text-to-SQL efficiency and generalizability.