Skip to content Skip to footer

A team of researchers from MIT and other institutions has developed a method to stop the performance deterioration in AI language models involved in continuous dialogue, like the AI chatbot, ChatGPT. Named StreamingLLM, the solution revolves around a modification in the machine’s key-value cache, acting as a conversation memory. Conventionally, when the cache overflows, the initial data elements are discarded, which may cause the system to fail. The researchers realized that by maintaining these initial data points, they could allow continuous conversation.

The StreamingLLM method helps a model remain efficient even with a dialogue surpassing 4 million words. Trial results showed StreamingLLM to be 22 times faster than another method that averts crashing by constantly recalculating parts of previous discussions. Such enhancement can allow a chatbot to maintain long conversations throughout a workday without needing continual rebooting, enhancing AI’s efficiency in tasks such as copywriting, editing, and generating code.

Large language models encode user query data into representations termed tokens. These models typically use an attention mechanism deploying tokens to produce new text. They store recent tokens in a KV Cache memory for later utilisation. The attention mechanism then creates an “attention map” that outlines the relationship of each token with the others. This helps the language models generate human-like text.

However, problems arise when the cache gets too extensive, slowing down computation, or when encoding content requires more tokens than the cache can hold, impacting performance. To counter these issues, researchers adopt a “sliding cache,” which replaces the oldest tokens with new ones. Unfortunately, this often leads to rapid performance deterioration as soon as the first token is ejected. The researchers found that keeping the first token in the sliding cache helps maintain performance even when the cache size is exceeded.

The researchers called the first token an “attention sink”. They found that having four attention sinks at the commencement of the sliding cache leads to optimal performance. They also discovered that even with the continual addition of new tokens and eviction of others, the positional encoding of each token must remain the same. StreamingLLM achieves persistent conversation by combining these two ideas and outperforms popular methods using recomputation.

The team also explored the use of attention sinks during model training by prepending placeholder tokens in all training samples. They found that training with attention sinks let a model maintain performance with only one attention sink in its cache.

The researchers aim to overcome a limitation in StreamingLLM’s future use, where the model can’t remember words that aren’t stored in the cache, by developing methods to recall evicted tokens or enable the model to memorize previous conversations. NVIDIA has incorporated StreamingLLM into its large language model optimization library, TensorRT-LLM. This research was partially funded by the MIT-IBM Watson AI Lab, the MIT Science Hub, and the U.S. National Science Foundation.

Leave a comment

0.0/5