When engaging in lengthy dialogues, advanced AI-powered chatbots often become inept, resulting in a significant performance downturn. A team of researchers from MIT alongside others have deduced a reason for this issue and devised a straightforward solution to prevent the bot from crashing or slowing down. The method, StreamingLLM, effectively ensures a continuous discussion irrespective of its length.
The remedy tweaks the key-value cache (considered a conversational memory) central to many large language models. Some techniques are designed to bump out the earliest pieces of data when this cache requires more storage than it has capacity for, leading to model failure.
By securing the first few points of data in the memory, the methodology allows a chatbot to continue a conversation regardless of the duration. Even when a dialogue stretches to over 4 million words, StreamingLLM ensures a model stays efficient. StreamingLLM demonstrated its superior performance by maintaining conversations over more than 22 times faster than an alternate method that tries to avoid crashing by constantly computing parts of past discussions.
This development signifies a critical step in maintaining chatbot conversations throughout a workday without needing frequent restarts, enabling efficient AI assistants that can undertake tasks like copy-editing, copywriting, and creating codes.
Large language models transform data like user queries into tokens. Most models incorporate an “attention mechanism” that uses these tokens to produce new texts. Since a bot usually writes fresh texts based on previously viewed text, it stores recent tokens in a KV Cache memory for future use. This attention mechanism constructs a grid inclusive of all tokens in the cache—air mapping the relationship between each token.
However, when this cache becomes extensive, the attention map grows proportionally, vastly slowing down computations. If encoding content demands more tokens than the cache can contain, the model’s performance diminishes. Researchers discovered a workable solution to this issue, which is to retain the first token in the sliding cache, which allows the model to retain its performance despite exceeding the cache size.
Some models utilize a Softmax operation in their attention mechanism as it provides a score to each token representing its correlation with other tokens. The Softmax operation requires all attention scores to add up to the total of 1. Since most tokens are weakly related, their attention scores are relatively low. To remedy this, any remaining attention score is dumped in the first token, known as the “attention sink.”
In developing StreamingLLM, researchers found that having four attention sink tokens at the onset of the sliding cache results in optimal performance. They also discovered that each token’s positional encoding must remain the same, despite new tokens being added and older ones being ejected.
Allowing for these two ideas, StreamingLLM can keep up uninterrupted conversations while outperforming a common method that utilizes recompilation. As an example, when a cache contains 256 tokens, a method using recompilation takes 63 milliseconds to decode a new token. In contrast, StreamingLLM only takes 31 milliseconds. If the cache grows to 4,096 tokens, recompilation requires 1,411 milliseconds for decoding a new token, while StreamingLLM only needs 65 milliseconds.
StreamingLLM has since been integrated into NVIDIA’s large language model optimization library, known as TensorRT-LLM, and funded by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation. Although StreamingLLM enables a model to maintain a continuous dialogue, it cannot recall words that are not stored in its cache, marking a limitation the researchers aim to address in the future by investigating ways to retrieve evicted tokens or enable the model to remember previous conversations.