A team of researchers from MIT, Meta AI, Carnegie Mellon University, and NVIDIA, have found a solution to the problem of the performance degradation of AI chatbots during extended human-AI conversations. They identified a challenge associated with AI conversation memory, known as the key-value cache, where data is bumped out when the cache exceeds its capacity. This can lead to the chatbot’s failure.
The team’s solution, called StreamingLLM, ensures the first few data points remain in memory, thus enabling chatbots to function continuously, regardless of conversation length. This method allows a chatbot to operate efficiently even in dialogues spanning over 4 million words, outperforming other methods that avoid crashing through the constant recomputation of previous conversations by over 22 times.
Leveraging the StreamingLLM method, an AI chatbot can engage in long dialogues throughout a work day without the need for constant reboots. This efficiency is beneficial for tasks like copywriting, editing, or generating code, presenting potential new applications for these chatbots.
Large language models typically store recent tokens in memory for future use. But when the cache gets excessively large, the attention map it generates also expands, therefore slowing down computation. This issue gets further exacerbated when the content encoding requires more tokens than the cache can handle. The researchers resolved this by ensuring that the first token, referred to as an “attention sink”, remains in the cache even when the cache size is exceeded.
In StreamingLLM, the team discovered that having four attention sink tokens at the start of the sliding cache resulted in optimal performance. They further established the need to maintain the positional encoding of each token, irrespective of the changes in the cache. This combination enabled StreamingLLM to maintain uninterrupted conversations while outperforming the popular recomputation method. For instance, in a 256-token cache, StreamingLLM would take 31 milliseconds to decode a new token, compared to the recomputation method’s 63 milliseconds.
Though the model cannot recall words not stored in the cache, future research aims to address this limitation through methods to retrieve evicted tokens or allow the model to remember past conversations. StreamingLLM has now been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM. The research is partly funded by the MIT-IBM Watson AI Lab, MIT Science Hub, and the U.S. National Science Foundation.