Researchers from MIT and other locations have developed a solution to an issue with chatbot performance deterioration following continuous dialogue with a human – a problem attributed to the memory degradation in large language machine-learning models. Their solution, termed StreamingLLM, works by retaining key data points in the memory cache, enabling a chatbot to continue a conversation, regardless of its length, without experiencing a decrease in performance.
Large language models incorporate words from user queries into representations known as tokens. These tokens are utilised in an attention mechanism to generate new text. As the model experiences longer dialogues, more tokens are generated and stored in a memory cache known as the Key-Value (KV) Cache. Yet, larger caches can slow down computation and performance because once the cache’s memory limit is exceeded, it removes the oldest tokens, creating an abrupt drop in performance.
To counter this, the researchers proposed maintaining the first tokens in the cache despite memory limitations. Although seemingly illogical, they found that models tend to designate the first token as an “attention sink”, allocating remaining attention scores not used by unrelated tokens to the first token. The first token must, therefore, be safeguarded to preserve the model’s operation.
Applying this understanding, the research team designed StreamingLLM, finding that holding four attention sink tokens at the beginning of the cache optimises performance. Additionally, they discovered the importance of maintaining a token’s positional coding in the KV Cache despite other tokens being pushed out.
When compared to another commonly used method, StreamingLLM outperformed it significantly. For example, in a cache of 256 tokens, a popular recomputation method took roughly twice the time to decode a new token in comparison to StreamingLLM. If the cache size expands to 4096 tokens, the difference in decoding speed is even more stark.
The development of StreamingLLM is deemed promising by other academics in the field due to its implications for a wide range of AI applications. Notably, it has already proven successful in the deployment of the conversational AI model, Mistral, on iPhones. The model’s shortfalls, including an inability to recall evicted tokens or past conversations, will be the focus of future research. NVIDIA has adopted StreamingLLM into its large language model optimisation library known as TensorRT-LLM, underscoring the method’s potential. The project was funded by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.