The implementation of APIs into Large Language Models (LLMs) is a major step towards complex, functional AI systems like hotel reservations or job applications through conversational interfaces. However, the development of these systems relies heavily on the LLM’s ability to accurately identify APIs, fill the necessary parameters, and sequence API calls based on the user’s input. The main challenge is the lack of diverse and real-world training and benchmarking data that’s critical for models to generalize well outside their training domains.
A new dataset called API-BLEND has been introduced to address this problem. API-BLEND is a product of a combination of human-annotated data and LLM-assisted generation, covering over 178,000 instances across training, development, and testing phases. This dataset is particularly unique due to its focus on sequencing tasks, which have been neglected in existing datasets. It covers a variety of API-related tasks from different domains such as semantic parsing, dialogue, and digital assistance.
API-BLEND owes its effectiveness to an exhaustive approach to data collection, employing language model-assisted generation, grammar-based generation, and direct inclusion of off-the-shelf datasets. This diverse approach ensures a rich variety of API sequences, parameters, and contexts, resolving the complexity of real-world API usage in LLMs. This dataset includes sequences derived from existing dialogues, converted into API calls through models like FLAN-T5-XXL, enriched by grammar rule-based transformations and pre-existing datasets adapted for API sequence evaluation.
Empirical evaluations show that API-BLEND surpasses other datasets as a training and benchmarking tool, with models trained on API-BLEND performing effectively in out-of-domain generalization. This strong performance is displayed by models’ out-of-domain precision when fine-tuned with API-BLEND data. This reflects their improved ability to handle the complexities of API integration.
Furthermore, API-BLEND has experienced rigorous benchmarking against nine open-source models across a variety of settings, such as few-shot testing, instruction fine-tuning on target datasets, and combined dataset fine-tuning. The results emphasize the effectiveness of API-BLEND in training models that excel in API detection, parameter filling, and sequencing, which are essential for executing difficult tasks through conversational AI.
To conclude, API-BLEND proves to be a significant resource for the advancement and benchmarking of tool-augmented LLMs, bridging the gap between synthetic data limitations and the necessity for real-world applicability. By providing a diverse and comprehensive dataset, API-BLEND pushes the development of API-integrated language models and establishes a new standard for dataset variety and utility. An exciting future endeavor would be to explore environment interactions and multilingual API commands to amplify the practical implication and reach of API-augmented AI systems.