Large Language Models (LLMs) have enjoyed a surge in popularity due to their excellent performance in various tasks. Recent research focuses on improving these models’ accuracy using external resources including structured data and unstructured/free text. However, numerous data sources, like patient records or financial databases, contain a combination of both kinds of information. Previous chat systems have typically used classifiers to direct queries to modules handling either structured or unstructured data, but fall short when it comes to questions requiring both types of data.
In response to this, researchers at Stanford University have proposed a cutting-edge approach which brings together conversational agents with hybrid data sources. Each employs both structured data queries with free-text retrieval techniques, confirming that over 49% of real-world queries require knowledge from both structured and unstructured data sources. They are proposing SUQL (Structured and Unstructured Query Language), a language that expands SQL with capabilities to process free text, enabling a combination of off-the-rack retrieval models and Large Language Models (LLMs) with SQL semantics and operators.
Total inclusiveness, accuracy, and efficiency are guiding principles in SUQL’s design. SUQL expands SQL with natural language processing operators like SUMMARY and ANSWER. By successfully transforming complex text into SQL queries, SUQL empowers itself to deal with complex queries. While SUQL queries can run on conventional SQL compilers, a naive implementation might result in inefficiency.
The SUQL method was tested via two experiments: one using HybridQA, a question-answering dataset, and the other using actual restaurant data from Yelp.com. The HybridQA experiment employed LLMs and SUQL, earning a 59.3% Exact Match (EM) and a 68.3% F1 score. This outperformed existing models by 8.9% EM and 7.1% F1 on the test set. In the real-world restaurant trials, SUQL displayed a 93.8% and a 90.3% turn accuracy in single-turn and conversational queries respectively. That’s a gain of up to 36.8% and 26.9% over linearization-based methods.
In essence, the researchers from Stanford have unveiled SUQL as a ground-breaking and formal query language for hybrid databases, merging both structured and unstructured data. SUQL’s innovation is in its fusion of free-text primitives into a compact, precise query framework. In-context training applied on HybridQA achieved results within 8.9% of the state-of-the-art standard, and can be trained on 62,000 samples. Unlike other methods, SUQL can manage large databases and free-text pools. Experiments with Yelp data demonstrate SUQL’s effectiveness, boasting a 90.3% success rate in satisfying user inquiries compared to a baseline of 63.4%. The full research paper can be found on the Stanford website.