Researchers from MIT and other institutions have discovered the key to why AI chatbot conversations can break down and developed a solution that enables continuous dialogue. The issue lies in the chatbot’s key-value cache (akin to a conversational memory). In some models, earlier data points are discarded when the cache reaches its limit, causing the bot to fail. The researchers’ solution, StreamingLLM, ensures that these initial data pieces remain in memory, allowing extended chat sessions.
StreamingLLM helps a model to perform at its best, even in long conversations exceeding four million words. It proved more efficient than another method that avoids crashing by continuously recalculating portions of past dialogues, operating over 22 times faster. This breakthrough could let AI chatbots maintain day-long conversations without requiring frequent reboots, making them more efficient assistants for tasks such as copywriting, editing, or generating code.
The researchers found that large language models code information like user query words into tokens and build “attention maps” with an attention mechanism to measure how strongly each word relates to others. When the cache becomes too large, these attention maps can slow computation and can degrade the model’s performance if encoding content requires more tokens than the cache can hold.
Researchers have previously used a “sliding cache” approach, whereby it replaces older tokens with new ones. However, this strategy can lead to a substantial decrease in performance once the first token is ousted. The team discovered that keeping the first token, or the “attention sink,” in the sliding cache maintains model performance even when cache limits are exceeded. They also found that for optimal performance, the streaming cache should carry four attention sink tokens at the start.
Furthermore, they found that each token’s positional encoding must remain constant, even as the cache changes. For example, token 6 must retain its number, even when token 5 is evicted, and it becomes the fifth token in cache.
By implementing these two ideas, StreamingLLM could sustain a never-ending conversation and outperformed a popular method employing recomputation. For instance, with a 256-token cache, StreamingLLM took only 31 milliseconds to decode a new token, compared to 63 milliseconds for the recomputation method. At a cache size of 4,096 tokens, StreamingLLM required just 65 milliseconds, while recomputation needed 1,411 milliseconds.
Despite the advances, StreamingLLM cannot remember words outside the cache. The researchers aim to address this shortcoming and allow the model to recall previous conversations in future research. StreamingLLM has already been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM.