Recent advancements in large language models (LLMs) have expanded their utility by enabling them to complete a broader range of tasks. However, challenges such as the complexity and non-deterministic nature of these models, coupled with their propensity to waste computational resources due to redundant calculations, limit their effectiveness.
In an attempt to tackle these issues, researchers from Stanford, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University have introduced SGLang, a Structured Generation Language designed to optimise the performance of LLMs. This new language utilizes the multi-call structure of LLM programs in a systematic way to enhance their execution speed. Comprising of a front-end language and a back-end runtime, SGLang streamlines the coding process and accelerates the implementation of LLM programs.
SGLang’s framework is equipped with various features that facilitate ease of use. It provides developers with tools to control parallelism and generation. It also integrates with Python libraries and control flow, enabling users to develop intricate prompting procedures using an intuitive syntax.
The research team also launched a compiler and an interpreter for SGLang. The interpreter sends primitive operations for asynchronous execution, managing the stream’s state. Concurrently, the compiler traces and compiles the SGLang program for improved optimization.
Two key techniques through which SGLang enhances program execution speed are RadixAttention and a compressed finite state machine. RadixAttention facilitates automated KV cache reuse across multiple generation calls, providing a remedy to the inefficient discarding of KV cache by current inference engines. On the other hand, the compressed finite state machine allows for faster restricted decoding by consolidating several token paths into a single, shorter one.
SGLang also allows optimization of API-only models like the GPT-4 for multi-call programs. It has already been used to create several LLM applications, including agent control, reasoning, retrieval-augmented pipelines, multiturn chat, multi-modality processing, and few-shot learning benchmarks.
The implementation of SGLang was tested using different models and GPUs, where it showed significantly improved performance compared to other programming and inference systems. Despite these achievements, the researchers acknowledge that there’s room for improvements like support for more output modalities, making RadixAttention work with various memory hierarchies, and enhancing the SGLang compiler.
The researchers, who have made the code available on GitHub, acknowledged that there are still areas for improvement and further research. Expanding output modalities, enhancing memory scheduling, and integrating higher-level primitives into SGLang were listed among potential enhancements. Despite these limitations, SGLang represents a significant advancement in the quest to optimize the efficiency and usability of LLMs.