Skip to content Skip to footer

Researchers from MIT and other institutions have developed a method that enables a chatbot to carry on unbroken conversation without crashing or losing performance. This method, named StreamingLLM, involves a tweak to the key-value cache, a form of “conversation memory”, that helps AI operate. The team found when the cache became too full, the first pieces of data were being expelled, causing the model to fail. By ensuring the initial data points remain in memory, the bot is able to continuously chat, regardless of conversation length.

StreamingLLM remained efficient even during a conversation exceeding 4 million words, and in terms of speed, it outperformed another method that avoids crashing by constantly recalculating past conversations, by over 22 times. This efficiency let the chatbot hold long chats all-day without rebooting, potentially providing effective AI assistants for tasks like code generation, copywriting, and editing.

The research focused on the AI’s Key-Value (KV) Cache that stores recent tokens used to generate new text. The attention map, showing how strongly each token relates to one another, starts to break down when the cache gets too large. When tokens for encoding content surpass the cache’s capacity, the model’s performance drops. To address this, researchers decided to keep the first token (the “attention sink”) in the sliding cache, enabling the model to maintain its performance even when the cache size is exhausted.

With StreamingLLM, researchers found four attention sink tokens at the beginning of the sliding cache led to optimal performance. They also discovered that the positional encoding of each token must remain the same, even as new tokens are added and others are removed.

An example of this is when a cache has 256 tokens, StreamingLLM takes 31 milliseconds to decode a new token, compared to 63 milliseconds using the recomputation method. If the cache grows to 4096 tokens, recomputation would need 1,411 milliseconds for a new token, whereas StreamingLLM needs just 65 milliseconds.

However, while StreamingLLM enables a model to conduct continuous conversation, the model can’t remember words not stored in the cache. In future work, the researchers plan to target this limitation, perhaps by finding ways to retrieve past tokens or enable the model to memorize previous conversations. StreamingLLM has been included in NVIDIA’s large language model optimization library, TensorRT-LLM. This research was partly funded by the U.S. National Science Foundation, the MIT Science Hub, and the MIT-IBM Watson AI Lab.

Leave a comment

0.0/5