A team of researchers from MIT and other institutions has discovered a key issue with large-scale machine learning models causing chatbot performance to degrade. When engaged in extensive dialogues, the huge language models behind bots like ChatGPT sometimes begin to fail. However, the team devised a solution enabling nonstop conversation without deterioration or lag. The researchers’ method, named StreamingLLM, involves a modification to the key-value cache (essentially the conversation memory) at the heart of most big language models. Generally, when the cache needs to store more data than it can handle, the earliest data is removed, causing model failure. By making sure these initial data points stay in memory, the method enables continuous chatbot conversation.
StreamingLLM keeps the model working efficiently, even during dialogues of over four million words. Compared to other methods that prevent failure by constantly recalculating parts of past conversations, StreamingLLM is over 22 times faster, making it possible for a chatbot to sustain long conversations throughout the day without frequent restarts, resulting in more efficient AI assistants for tasks like copywriting, editing, or code generation.
When large language models transform data (like words in user queries) into representations called tokens, they often use an attention mechanism to generate new text. They store recent tokens in a memory area (a KV Cache) and utilize them later. The attention system creates an ‘attention map’, an arrangement of all the tokens in the cache that shows the strength of the relationships between each word. When the cache grows excessively, the attention map becomes correspondingly larger, leading to slower computations. Additionally, if encoding content needs more tokens than the cache can store (for example, an academic paper may consist of about 10,000 tokens, whereas one prevalent model can only store 4,096 tokens), this also leads to performance degradation.
To resolve these issues, researchers frequently use a ‘sliding cache’ that kicks out the oldest tokens to accommodate new ones. Still, the model’s efficiency tends to drop dramatically at the removal of the first token, swiftly decreasing the quality of newly yielded words. According to the new study, retaining the first token in the sliding cache lets the model keep performing efficiently even if the cache size becomes too large.
The researchers found that keeping four ‘attention sink’ tokens at the start of the sliding cache leads to peak efficiency. They also discovered the necessity of maintaining each token’s positional encoding, regardless of whether tokens are added or removed. By merging these two ideas, they created StreamingLLM, which outperforms existing methods that use recomputation, and is capable of maintaining a continuous discussion.
StreamingLLM is now part of NVIDIA’s big language model optimization library, TensorRT-LLM, with potential application across a wide variety of AI tasks. The team also plans to address present limitations, such as the inability of a model to remember words not stored in the cache, by studying methods to retrieve discarded tokens or enable memory of past conversations. The project was funded in part by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.