Skip to content Skip to footer

A Simultaneous Coding Structure for Assessing Efficiency Challenges in Handling Several Extended-Context Requests under Restricted GPU High-Speed Memory (HBM) Conditions

Large language models (LLMs) are becoming progressively more powerful, with recent models exhibiting GPT-4 level performance. Nevertheless, using these models for applications requiring extensive context, such as understanding long-duration videos or coding at repository-scale, presents significant hurdles. Typically, these tasks require input contexts ranging from 100K to 10M tokens — a great leap from the standard 4K token limit. Researchers are therefore attempting to figure out how to deploy 1M context production-level transformers in a cost-effective manner, similar to their 4K counterparts.

The main obstacle in deploying long-context transformers is the sheer size of the Key-Value (KV) cache required. For example, a model with a 30+B parameter and a 100K context demands an enormous 22.8GB of KV cache. This size is a stark contrast to the mere 0.91GB necessary for a 4K context, illustrating the drastic rise in memory requirements as context length increases.

To address these difficulties, researchers at the University of Edinburgh have created a concurrent programming framework for analyzing efficiency issues when serving multiple long-context requests using limited GPU high-bandwidth memory (HBM). By using a 34B GPT-3.5 level model with a 50K context on an A100 NVLink GPU as an example, they highlighted four key challenges in deployment due to the large KV cache, suggesting a comprehensive approach to developing solutions for these issues.

Approaches for compressing the KV cache across four dimensions — layer, head, token, and hidden — are extensively discussed in this research. The layer dimension could potentially be compressed by skipping layers during prefilling, which can substantially reduce the KV cache. By identifying and preserving only the most critical heads, significant compression in the head dimension is achievable. Additionally, exploring methods for compressing the token and hidden dimensions can add more storage efficiency.

The current research provides a thorough analysis of the issues in deploying long-context transformers with the goal of achieving cost-effectiveness equivalent to 4K models. This endeavor seeks to democratize advanced AI applications and the establishment of a concurrent programming framework that can efficiently handle long-context language models. Considering these analyses and optimizations, this research signifies a major step towards end-to-end system optimization for deploying long-context transformers.

Leave a comment

0.0/5