Researchers from MIT and other institutions have proposed a solution to the challenge of AI systems losing the context of conversation in extended dialogues. Large language models such as ChatGPT, which enable the functioning of chatbots, often struggle to retain information from long conversations, resulting in rapid performance deterioration.
The team has developed a method called StreamingLLM, which involves adjusting the key-value cache or conversation memory to ensure the first few data points remain stored even in extensive conversations. This modification allows the chatbot to continue the conversation without system crashes or slow-downs, even when a conversation extends to over 4 million words. The method has shown to be over 22 times faster than the currently used recomputing method, demonstrating its potential for applications in copywriting, editing, and generating code.
Large language models encode data into representations called tokens, which are stored in a key-value cache memory. An attention map is generated based on these tokens which is crucial in producing human-like conversational content. But in extended dialogues, the cache memory may get overwhelmed, leading to slow computation and resulting in the model’s performance drop.
To solve this, a “sliding cache” system is used that pushes out the oldest tokens to make room for the new ones. However, pushing out of the first token often results in a severe drop in the model’s performance. The researchers found that by retaining the first token in the sliding cache, the model can maintain its performance even when the cache size is overshot. This token is known as an “attention sink”, which is necessary to maintain the system’s proactive attention allocation.
A modification proposed by the team for StreamingLLM indicated that having four attention sink tokens at the start leads to the optimum performance. The team also maintained the positional encoding of each token in the cache, with the identification number of the tokens remaining the same even when they move positions in the cache.
Evaluations showed that when the cache contains 256 tokens, the recomputation method took 63 milliseconds to decode a new token, while StreamingLLM accomplished the same task in half the time. The time difference grew exponentially with the increase in cache size.
According to external AI experts, the innovative approach used in StreamingLLM ensures stable memory usage and performance, enabling applications in larger, ongoing conversations, thereby transforming AI-driven generation applications.
Looking to the future, the researchers are exploring methods to retrieve tokens that have been evicted or enable the model to memorize previous conversations, as the current model cannot recall words not stored in the cache memory. This could potentially overcome the limitation of the model’s inability to recollect past dialogues.
The StreamingLLM method has already been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. Overall, StreamingLLM has the potential to revolutionize the operation of AI chatbots and their application across a broad array of tasks. The research is funded partly by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.