Researchers from the Massachusetts Institute of Technology (MIT) and partner organizations have developed a solution to address a key issue limiting the effectiveness of AI chatbots. Large language machine-learning models, such as ChatGPT, often crash or slow down during extended rounds of dialogue with humans. The study identified the cause of this problem as the eviction of the earliest data points held in a key-value cache (a form of conversational memory) when the cache exceeds its storage capacity.
These researchers have now developed a method called StreamingLLM, which ensures these early data points remain in the cache, thereby allowing the chatbot to function optimally regardless of how long the conversation continues. StreamingLLM demonstrated efficiency during conversations extending to over 4 million words, outperforming other methods by more than 22 times. This enhancement could see AI being effectively used in applications like copywriting, editing, or code generation without needing continual rebooting.
An unexpected phenomenon discovered during this research is the crucial role of the first data point, or “token,” in maintaining the chatbot performance. Dubbed an “attention sink” by the researchers, this token, by always being present and visible to other tokens, helps maintain the attention dynamics of the model. The researchers found that having four attention sink tokens leads to optimal performance.
Furthermore, StreamingLLM’s effectiveness was maintained by ensuring that the positionally encoded identity assigned to each token remained consistent even when new ones were added or older ones removed from the cache. The StreamingLLM system markedly outperformed a popular method that leverages recomputation – for example, when the cache holds 256 tokens, a new token takes 31 milliseconds to be decoded by StreamingLLM versus 63 milliseconds using the recomputation method. At a larger cache size of 4,096 tokens, StreamingLLM needed just 65 milliseconds compared to the recomputation method requiring 1,411 milliseconds.
While StreamingLLM shows promise, the AI model cannot remember words that are no longer in the cache. To address this, the team aims to study techniques for retrieving evicted tokens or enabling the model to memorize past conversations. StreamingLLM has already been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM. The study was partly funded by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.