The rise in the use of large language models (LLMs) such as GPT-3, OPT, and BLOOM on digital interfaces has highlighted the necessity of optimizing their operating infrastructure. LLMs are known for their colossal sizes and considerable computational resources required, making them difficult to efficiently implement and manage.
Researchers from various institutions, including Microsoft Research and universities such as ETH Zurich and Carnegie Mellon, have created a new system named DéjàVu to tackle these issues effectively. The system’s core is a dynamic Key-Value cache streaming library, DéjàVuLib, specifically designed to improve the operation of LLMs. Unique to DéjàVu is its method of dealing with the notable latency in prompt processing and token generation, which often results in under-utilization of GPUs.
DéjàVu changes the game by disaggregating prompt-token, assigning distinct computational resources to each phase. This tactic takes into consideration the differing memory and computational demands of prompt processing and token generation. In doing so, DéjàVu ensures GPUs are effectively utilized, thereby seamlessly bridging the intense requirements of prompt processing and the fairly constant token generation phase.
One key feature of DéjàVu’s strategy is the technique of micro-batch swapping. This innovative approach maximizes GPU memory efficiency by dynamically swapping micro-batches between GPU and CPU memory. This allows for larger batch sizes without the requirement for proportional increases in GPU memory. This advancement not only boosts throughput but also enables larger models to be served under set hardware limits.
DéjàVu also enhances system resilience through its state replication feature. By duplicating the KV cache state across separate memory stores, DéjàVu can quickly get back on track after a failure by resuming operations from the last saved state, thus reducing overall impact on performance.
Tests have shown that DéjàVu can improve throughput by up to two times that of current systems, leading to better user experiences by reducing waiting times and bolstering trust in LLM-powered services.
DéjàVu’s modular architecture ensures that it can adapt to the ever-changing needs of LLM applications. Combined with its improvements in efficiency and reliability, DéjàVu represents a significant journey towards harnessing the power of LLMs in day-to-day applications.
Therefore, DéjàVu is a revolutionary system enhancing the efficiency and fault tolerance of LLMs, significantly outperforming existing models. The innovative segregation of prompt processing and token generation processes, in addition to micro-batch swapping, optimizes GPU utilization and memory management. With quick recovery features and minimized interruptions to service, DéjàVu has exhibited promising potential to enhance user experiences across LLM-powered services.