When engaging in continuous dialogues, powerful language machine-learning models that drive chatbot technologies such as ChatGPT can struggle to cope, often leading to a decline in performance. Now, a team of researchers from MIT and elsewhere believe they have found a solution to this issue, which ensures chatbots can continue a conversation without crashing or slowing down.
The solution, named StreamingLLM, involves making a minor adjustment to the key-value cache (a form of conversation memory) that lies at the heart of numerous large language models. Current methods often see the first points of data being pushed out when the cache is filled over capacity, resulting in the model’s failure.
However, by guaranteeing that this initial data stays in memory, the researchers have shown it’s possible for a chatbot to maintain a conversation regardless of its length. The new method enables the model to remain efficient, even if a conversation stretches over 4 million words. Compared to another approach that avoids crashing by frequently recalculating previous conversations, StreamingLLM is more than 22 times quicker.
This progress could pave the way for chatbots to conduct extended conversations throughout the working day without needing regular restarts, providing effective AI support for tasks such as copywriting, editing, or coding.
Despite assuming that the first word of a dialogue would not impact the last, researchers were surprised to discover the critical role played by the first token in a conversation, known as the “attention sink”. The attention sink gathers any remaining attention score after the Softmax operation has processed the token relationships. Researchers found that including four attention sinks at the start of the sliding cache led to the best performance.
In the future, the researchers plan to overcome limitations whereby the model doesn’t remember words not stored in the cache. They aim to look into methods for retrieving tokens that have been evicted or enabling the model to recall previous dialogues. Nvidia has already integrated the new approach into its large language model optimization library, TensorRT-LLM.
This work is a result of funding by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation. Various machine learning and computer science experts from leading institutions have lauded the innovative method and its potential implications in driving transformative AI applications.