Skip to content Skip to footer

Researchers from MIT and other institutions have devised an innovative solution to prevent chatbots from crashing during prolonged dialogues. The method, known as StreamingLLM, makes a simple adjustment to the key-value cache, essentially the ‘conversation memory,’ of well-developed machine-learning models. By ensuring the first few data points don’t get bumped out, the chatbot can maintain sustained dialogues. The researchers found StreamingLLM enabled a model to be efficient even during conversations exceeding 4 million words.

Test results indicated that StreamingLLM performed more than twenty-two times quicker than another method that also prevents crashing by recurrently recalculating part of the previous conversations. Therefore, this advancement may allow a chatbot to maintain lengthy discussions throughout a workday without requiring regular reboots. This breakthrough could facilitate more efficient AI assistants for tasks such as copywriting, programming, or editing.

Large language models transform data (like words in a user request) into tokens. Many models use an ‘attention mechanism’ that employs these tokens to produce new text. The AI chatbot writes new text based on recent text, maintaining recent tokens in a memory known as a KV cache. Once the cache size reaches its maximum capacity, the model performance drops significantly.

To solve this issue, researchers use a ‘sliding cache’ that replaces older tokens with new ones. However, the model’s performance often plummets as soon as the first token is removed, reducing the quality of newly generated words. The team found that by keeping the first token in the sliding cache, the model maintains its performance even when the cache size exceeded its capacity.

The researchers attribute this phenomenon to an activity they termed ‘attention sinks.’ Some models assign a score to each token that indicates its relevance to each different token. As most tokens aren’t significantly related, their attention scores are low. Any leftover attention score is dumped into the first token, known as the ‘attention sink’. Keeping the attention sink in the cache is essential to maintain model dynamics.

When developing StreamingLLM, the researchers deduced that having four attention sink tokens at the beginning of the sliding cache resulted in optimal performance. They also observed that the positional encoding of each token must remain constant, even as new tokens are introduced and older ones are evicted.

Combining these two ideas allowed StreamingLLM to maintain a continuous conversation while outperforming a known method using recomputation. StreamingLLM also demonstrates constant memory usage and performance, even when processing texts of up to 4 million tokens in length.

Notably, while StreamingLLM enables continuous conversation, the model cannot remember words that aren’t present in the cache. However, efforts are being made to overcome this limitation by working on methods that recall evicted tokens or enable the model to remember previous dialogues.

Currently, StreamingLLM has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. The research was funded partially by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.

Leave a comment

0.0/5