A team of researchers from MIT and other institutions have found a way to prevent chatbots driven by large language machine-learning models from collapsing during lengthy conversations. The failure typically occurs when the key-value cache, or “conversation memory”, in some methods cannot contain more information than its capacity, resulting in the first data points being removed and resulting in a model failure.
The researchers developed a method called StreamingLLM that ensures these initial data points stay in memory, allowing the chatbot to continue the conversation regardless of its length. Compared to other techniques preventing system crashes by recalculating portions of the conversation, StreamingLLM performed over 22 times faster and therefore could keep up long chats without needing to reboot continuously. This might enable efficient AI assistants for tasks such as copywriting, editing, or code generation.
Large language models generally encode data into tokens and use an attention mechanism to generate new text based on these tokens. Besides, the chatbot creates new text based on the recently seen text and uses an “attention map” mapping out how each token relates to others. However, if new tokens exceed the cache capacity, the model’s performance significantly drops.
To solve this problem, the researchers used a “sliding cache” approach that removes the oldest tokens to add new ones but found that the system’s performance decreases sharply as soon as the first token is displaced. However, they observed that if the first token remains in the sliding cache, even if the cache limit is exceeded, the system’s performance can be maintained.
Although it seems counterintuitive as the first word in a conversation is unlikely to have anything to do with the last word, the researchers found that some models use a Softmax operation to assign a score to each token that represents its relation to the other token. The model dumps the residual attention score in the first token, which the researchers term the “attention sink”.
In their model, StreamingLLM, they found that having four attention sink tokens at the beginning of the “sliding cache” leads to the best performance. Additionally, the coding for each token must remain the same, even when new tokens are added and others are displaced. By combining these ideas, they enabled StreamingLLM to maintain a continuous conversation while performing better than a popular alternative method.
While StreamingLLM allows for continuous conversation, the model can’t remember words not stored in the cache. In the future, the researchers plan to address this limitation by examining methods to recall tokens that have been removed or enable previous conversation memorization by the model. StreamingLLM has now been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM.