Skip to content Skip to footer

Researchers from MIT and other institutions have developed a solution to maintain continuous human-AI interactions without the chatbot crashing or slowing down. The solution, known as StreamingLLM, involves tweaking the key-value cache (like a conversation memory) that forms the heart of many large language models. Under the conventional setup, the cache, when filled beyond its capacity, pushes out the first few data points, potentially causing the chatbot to crash. StreamingLLM allows the initial data points to stay in memory, which enables the chatbot to keep conversing irrespective of conversation durations.

The StreamingLLM model remains efficient even in extremely long conversations, for example exceeding 4 million words, and was found to be over 22 times faster than an alternative approach of constant recomputation of past conversations. This advancement could facilitate uninterrupted, day-long conversations using a chatbot, with possible applications including AI assistants for tasks like copywriting, editing, and code generation.

The research illuminates the crucial role of the first few tokens. When adding fresh tokens to a cache nearing its size limit, the most aged tokens get removed under a “sliding cache” approach. However, model performance sharply drops from the moment the first token is evicted. Whereas, if the first token is retained in the sliding cache, the model continues to perform well even when the cache exceeds its limit. Pertinently, they found that four attention sink tokens at the start of the sliding cache resulted in the best performance.

StreamingLLM outperformed the recomputation method across various token lengths during testing. For a cache size of 256 tokens, StreamingLLM needed 31 milliseconds to decode a new token, compared to 63 milliseconds under recomputation. For a larger cache size of 4,096 tokens, StreamingLLM took 65 milliseconds, significantly quicker than the 1,411 milliseconds for recomputation.

Moreover, the researchers explored the training implications of using attention sinks by adding several placeholder tokens to all training samples. The results showed that training with attention sinks could allow a model to keep performing with only one attention sink in its cache, as opposed to the typical need for four.

However, while StreamingLLM can facilitate uninterrupted conversations, current models can’t recall words that aren’t kept in the cache. Hence, future work will investigate methods to access evicted tokens or enable the AI to remember previous interactions. StreamingLLM has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM. The study was partially funded by the U.S. National Science Foundation, the MIT-IBM Watson AI Lab, and the MIT Science Hub.

Leave a comment

0.0/5