Skip to content Skip to footer

Researchers from MIT and other institutions have found a solution to an issue that causes machine-learning model-run chatbots to malfunction during long, continuous dialogues. They found that significant delays or crashes happen when the key-value cache, essentially the conversation memory, becomes overloaded leading to early data being ejected and the model to fail. The researchers developed a method, called StreamingLLM, which ensures these initial data points are not removed, allowing the chatbot to continue uninterrupted conversations. StreamingLLM remained efficient despite conversations extending beyond 4 million words, outperforming another method by more than 22 times. This could potentially enable AI to conduct lengthy conversations throughout a day without needing to be rebooted, thereby increasing the efficiency of AI assistants in tasks such as copywriting, editing, or generating code.

Large language models encode data, like words from the user, into tokens. An attention mechanism utilizes these tokens to generate new text. However, when the cache becomes vast, the attention map also swells, leading to slower computation and, in some cases, model failure due to insufficient cache capacity. To tackle these issues, researchers employed a “sliding cache” that replaced old tokens with new ones. But this method also resulted in reduced performance, as removing the first token from the cache often led to quick degradation of performance.

The researchers found, rather unintuitively, that retaining the first token in the cache maintained model performance, even when cache size was exceeded. Further investigation revealed that some models assign scores to each token, representing how much it relates to other tokens, and any remaining score is dumped into the first token, named the ‘attention sink’ by the researchers. In developing StreamingLLM, they found that having four attention sink tokens and keeping the positional encoding of each token constant led to an optimal performance.

For instance, the recomputation method took 63 milliseconds to decode a new token when the cache had 256 tokens, while StreamingLLM took 31 milliseconds. If the cache size grew to 4,096 tokens, recomputation required 1,411 milliseconds for a new token, whereas StreamingLLM required just 65 milliseconds. StreamingLLM has since been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM.

The researchers also found that training with the presence of attention sinks greatly improved the model’s performance. However, the existing limitation of the approach is that it cannot remember words that aren’t stored in the cache. To solve this, the researchers plan to explore ways of retrieving removed tokens or enabling the model to memorize previous conversations.

Leave a comment

0.0/5