A team of researchers from MIT and other institutions has discovered a remarkable cause of performance deterioration in chatbots and found a simple solution that allows persistent, uninterrupted dialogue. This problem occurs when human-AI interaction involves continuous rounds of conversation, which can overburden the large language machine-learning models that power chatbots like ChatGPT.
The researchers have addressed the issue by tweaking the key-value cache (memory of the conversation) inherent in many large language models. Ordinarily, when the cache must hold more information than its capacity allows, the earliest data gets removed, potentially causing the model to fail. By ensuring that the initial data points stay in memory, the researchers have devised a way for a chatbot to sustain prolonged conversations.
The new method is known as StreamingLLM and allows a model to remain efficient, even if a conversation exceeds 4 million words. In comparison with another conventional method designed to avoid system crashes, StreamingLLM works over 22 times faster, facilitating long dialogues without necessitating constant system reboots. This improved performance can be utilized in AI assistants for tasks such as copywriting, editing, or coding.
Large language models encode data, such as words users type, into representations known as tokens. These models use an attention mechanism that leverages these tokens to generate new text. However, when the memory, termed KV Cache, grows too large, it slows down computation and performs poorly when more tokens are required than the cache can accommodate.
In attempting to solve these problems, researchers conventionally employ a sliding cache that removes outdated tokens to add new ones. However, a sudden drop in performance often occurs once the first token is removed, leading to a rapid decrease in the quality of newly generated words.
Through the new study, the researchers found that preserving the first token in the sliding cache allows the model to maintain its performance even when the cache size is exceeded. This unexpected realization led the researchers to further explore and uncover the phenomenon behind it: the existence of an “attention sink,” as they named the initial token to which the model assigns any remaining attention score.
In developing StreamingLLM, the team discovered optimal performance when four attention sink tokens were at the start of the sliding cache. They also found that each token’s positional encoding must remain constant, even as new tokens are added and old ones are removed. By incorporating these two findings, the scientists enabled StreamingLLM to maintain uninterrupted conversation and outperform a commonly used method that employs recomputation.
Notably, StreamingLLM has been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM. Although the model cannot remember words not stored in the cache presently, future work will address this restriction by investigating methods to retrieve tokens that have been evicted or enable the model to memorize prior dialogues.