Skip to content Skip to footer

A team of researchers, including those from the Massachusetts Institute of Technology (MIT), have created a system called StreamingLLM that allows chatbots to maintain ongoing dialogues without suffering from performance issues. The method involves a reconfiguration of the model’s key-value cache—a form of memory storage—that commonly leads to models failing when the cache is overloaded with information.

The researchers’ solution was to ensure that the beginning pieces of data remained in the cache memory, enabling the chatbot to continue conversing regardless of the length of the conversation. The StreamingLLM model is efficient even when a conversation exceeds 4 million words, and it operates more than 22 times faster than another method that avoids system crashes by continually recalculating parts of the dialogue.

This development could enable chatbots to hold lengthy conversations throughout the day without rebooting continuously. Such persistent deployment is helpful for roles like copywriting, editing, or creating code. The paper’s lead author, Guangxuan Xiao, a graduate student in electrical engineering and computer science (EECS), believes that the ability to maintain ongoing dialogues could open up new applications for chatbots.

The researchers noted that the balance between old and new tokens in the cache is critical for model performance. However, they found it surprising that a model needed to keep the first token—or data point—in the memory, even when the cache was outstripped. They discovered that the first token, dubbed the “attention sink,” is used by the model to distribute remaining attention scores as part of the Softmax operation in the attention mechanism. The team realized that up to four tokens at the start of the memory cache optimally perform as attention sinks.

The researchers also found it necessary to keep the positional encoding of each token unchanged, even as new tokens are added, and others evicted from the cache. For example, if data point 5 is bumped out, the sixth token remains encoded as number six, despite now being the fifth token in the cache.

Tests against other methods that use recomputations revealed that the StreamingLLM system maintains a continuous conversation and performs better. For instance, when the cache held 256 tokens, StreamingLLM could decode new tokens twice as fast as a recomputation method. This speed advantage grew significantly when the cache held 4,096 tokens, with StreamingLLM needing less time to decode a new token.

However, StreamingLLM does not have the ability to remember words that are not stored in the cache, which will be the focus of future improvements. Despite this limitation, StreamingLLM has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. Other researchers also recognized the potential of StreamingLLM for its ability to process up to 4 million tokens and deploy complex models on devices like iPhones.

Leave a comment

0.0/5