A group of researchers from MIT and other institutions have pinpointed a key issue that causes performance degradation in AI chatbots during long conversations and have developed a simple solution to rectify it. Large language machine-learning models such as the ChatGPT use key-value cache to store data. However, when the cache needs to hold more information than its capacity, usually the oldest data is removed first, leading to a drop in model performance.
The researchers’ method, named StreamingLLM, manages to keep a few initial data points in memory, enabling the consistent functioning of the bot, even during prolonged dialogues. StreamingLLM performed 22 times faster compared to another method that tries to prevent crash by constantly recomputing parts of past conversations. This paves the path for AI assistants to handle tasks such as copywriting, editing, or generating code without requiring constant rebooting.
Large language models start to slow down after a certain point as the increasing size of the cache leads to a larger “attention map”. An attention map maps out the degree of relationship between each piece of data in the cache, and this feature enables the generation of human-like text. The researchers discovered that maintaining the first token in the sliding cache despite exceeding the cache size, the performance remains consistent. They called this first token “attention sink”, and found that they must always keep the attention sink in the cache to maintain the model dynamics.
For optimal performance, the researchers found that beginning the sliding cache with four attention sink tokens was best. They also found that the positional encoding of each token must stay constant, regardless of any changes in the cache. By implementing these two techniques, StreamingLLM can maintain long conversations while outperforming a popular method using recomputation.
The researchers, in the future, are planning to address the limitation of the model failing to remember words removed from the cache, by investigating methods of retrieving removed tokens and enabling the model to memorize previous dialogues. StreamingLLM has already been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. This work received funding from entities like the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.