Large language AI models are notorious for crashing or slowing down during lengthy human-AI dialogues, posing a major barrier to the effective use of chatbots in many applications. Now, a team of researchers from MIT and other institutions propose a novel solution – by modifying the key-value cache, or the ‘conversation memory’, they improved the chatbots’ ability to handle extensive discourse without failing.
The researchers noticed a specific issue with the key-value cache. In traditional large language models, if a cache has to retain more data than it has capacity for, the oldest pieces of data are expelled. That, quite often, leads to the failure of the model. To fix this issue, the researchers developed a method called StreamingLLM that allows the first few data points in the cache to remain, thereby enabling the model to maintain functionality even when its capacity limit is breached.
During a lengthy dialogue, StreamingLLM can allow a language model to operate efficiently even with a conversation of over 4 million words. It operates up to 22 times faster than other methods that try to avert crashing by recalculating parts of past discussions repeatedly. This capability could improve AI assistants’ effectiveness in handling various tasks, such as copyediting and generating code, throughout the whole day without necessitating reboots.
In large language models, various data forms, such as words in a user query, are transformed into representations called ‘tokens’. These tokens are stored in memory, or KV Cache, and used later to generate new context. However, when the cache gets overly large, the attention map, which indicates how strongly each token relates to the others, becomes too big and slows down computation.
For the method to work, the researchers discovered that the positional encoding of each token must remain the same, even as new tokens are added and old ones are bumped out. They also found that having four attention sink tokens – the first tokens in the sliding cache – yields optimum results.
StreamingLLM has already proved to be significantly more cost-effective and faster than other methods. For example, at a cache size of 4,096 tokens, a recomputation-based method took 1,411 milliseconds to process a new token, while StreamingLLM needed only 65 milliseconds. The method has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM.
Despite the success, the model currently cannot remember words not stored in the cache. Future research is planned to overcome this limitation.