Researchers from MIT and other institutions have developed a method to prevent AI chatbots from failing during extensive conversations. The team’s solution centres on the key-value cache (KV Cache), or the chatbot’s memory, within language models. Existing models often fail when the cache is overloaded with information, as incoming data pushes out the initial data. By keeping the first few data points within the memory, the researchers’ method permits AI chatbots to hold conversations of any length.
Named StreamingLLM, this method allows the chatbot model to efficiently manage conversations spanning over 4 million words. Compared to a method that prevents failure by consistently recomputing past conversation segments, StreamingLLM performed over 22 times faster. This efficiency could enable AI chatbots to sustain long conversations without requiring constant reboots, creating a more effective AI assistant for tasks such as copywriting, editing or coding.
Large language models convert data, or words in a user query, into tokens that are stored in the KV Cache. These tokens help generate new text with the assistance of an attention map, which outlines how each token relates to the others. However, when the cache becomes saturated, the attention map expands, slowing computation. If encoding content needs more tokens than the cache can store, the model performance drops.
To address this, current methods employ a sliding cache, bumping out the oldest tokens to add new ones. But pushing out the first token usually leads to steep performance drops. Researchers discovered that by maintaining the first token in the sliding cache, performance is preserved, even when cache capacity is exceeded.
One reason behind this involves the Softmax operation in the attention mechanism, which allocates scores to each token based on their relationships with others. Since most tokens are weakly related, they get low attention scores. Any surplus attention score gets dumped in the first token, termed the ‘attention sink’ by the researchers. The team had to ensure that the attention sink always remained in the cache to maintain model performance, and found that having four attention sinks at the start of the sliding cache led to optimal outcomes.
The researchers also ascertained that positional encoding of each token must remain static as new tokens are added and old ones are removed, allowing StreamingLLM to have ongoing conversations while performing better than the popular recomputation method.
However, StreamingLLM has limitations; it can’t recall words not stored in the cache. The researchers are aiming to tackle this limitation in the future by looking into methods to retrieve evicted tokens or enable the model to remember past conversations. StreamingLLM has already been incorporated into NVIDIA’s language model optimization library, TensorRT-LLM. The research was funded in part by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.