Skip to content Skip to footer

Research from MIT and other institutions has developed a method, called StreamingLLM, that enables AI chatbots to maintain continuous dialogue without crashing or slowing down. The technique tweaks the key-value cache or conversation memory at the core of large language models. Failure often occurs when this cache needs to store more information than it can handle, so the first bits of data are eliminated. This solution keeps the first few data points in memory to prevent the model from failing. StreamingLLM allows a model to continue a conversation even if it extends over 4 million words. It outperformed another method by over 22 times, potentially enabling chatbots to carry out long conversations throughout the day without requiring continuous rebooting.

Large language models, such as AI chatbots, encode data into tokens. These tokens are then employed to generate new text. However, when too many tokens are created than the cache can store, the model’s performance suffers. Researchers traditionally use a “sliding cache” to deal with this, which removes older tokens to make space for new ones. But this can cause the model’s performance to fall rapidly. The new paper found that if the first token remains, the model’s performance does not suffer. They discovered it was because the model designates the first token as the “attention sink,” and any token not related to others gets dumped.

In constructing StreamingLLM, researchers found that having four attention sink tokens at the start of the sliding cache leads to optimal performance. They also found it essential that each token’s positional encoding remains the same, despite new tokens being added and others removed. Combining these two ideas allowed StreamingLLM to uphold continuous conversation while performing better than a popular method.

There is still a limitation to be addressed, as the model cannot remember words not stored in the cache. Researchers plan to tackle this in the future by looking into methods to retrieve evicted tokens or enabling the model to recall previous discussions. StreamingLLM has already been integrated into NVIDIA’s large language model optimization library.

Leave a comment

0.0/5