Researchers from MIT and other institutions have developed a method to prevent chatbot performance from deteriorating during prolonged human-AI interactions. The method, called StreamingLLM, is based on a slight modification to the crucial key-value cache (KV Cache) that is central to large language models employed by many AI-driven platforms. The KV Cache, similar to a “conversation memory,” often encounters difficulties when it fills to capacity, forcing out the oldest data and potentially causing failures. The researchers demonstrated that by retaining the first few data points in this memory, a chatbot’s performance can be maintained, no matter the length of the conversation.
It was previously assumed that retaining earlier data points was not necessary, as their relevance to the most recent conversation was often negligible. Yet, the researchers identified that a process called the “Softmax operation” used in some chatbots assigns a weighting to each data point in the model, often giving a substantial weighting to the first token (or word) regardless of its relevance. They dubbed this the “attention sink.” This understanding allowed the developers to build StreamingLLM, a model that remains stable and efficient even during very lengthy conversations.
StreamingLLM has been shown to perform up to 22 times faster than other models that avoid failures by continuously recalculating past discussions. It can maintain efficiency even when a conversation stretches to over 4 million words. For instance, when using a cache of 256 words in a conversation, StreamingLLM took 31 milliseconds to decode a new word, compared to 63 milliseconds for the current recomputation method. Even when the cache size expanded to 4,096 words, it only required 65 milliseconds to decode a new word.
The new method will allow for chatbots to engage in lengthy discussions throughout the day without needing to be reset, boosting the efficiency of AI assistants. Its uses could potentially extend to areas such as copywriting, editing, or generating code, and it has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. However, the current model cannot remember words that are not stored in the cache, and the team plans to address this limitation in future research.