Skip to content Skip to footer

MIT researchers have discovered a fault in the design of language machine-learning models that can cause AI chatbots’ performance to drastically deteriorate during lengthy conversations. Essentially, when data stored in a chatbot’s “memory” (known as the key-value cache) exceeds its capacity, the earliest data is removed, sometimes causing the chatbot to malfunction or slow down. To counter this, the team created a system called StreamingLLM that keeps the first piece of data in the chatbot’s memory, allowing it to continue the dialogue no matter how prolonged. The team discovered that by employing four “attention sink” tokens, StreamingLLM can maintain optimal performance even during extremely lengthy discussions. They also found the positional encoding of each token had to be consistent, with new data being added to the cache while the older data is removed, but not allowing the value of each token’s place to be altered.

The new method is not only able to handle conversations with over 4 million words but is also more than 22 times faster than other systems designed to prevent a chatbot from crashing, which continually recompute part of past conversations. This could pave the way for more capable AI assistants capable of extensive tasks like copywriting, editing, and coding.

The researchers’ work provides a solution to a significant challenge facing the use of large-scale language models in real-world applications. They established that performance declines when the model’s memory cache is full and the first pieces of data are removed. Reserving the first few data points in memory allows the model to keep working optimally even when the cache is exceeded. The researchers named this first token an “attention sink,” which they found plays an essential role in maintaining a model’s performance.

The StreamingLLM has been incorporated into NVIDIA’s large language model optimization library, TensorRT-LLM, representing a significant advance in AI chatbot technology. While the model presently has some limitations, with the inability to remember words not stored in the cache, future research will focus on addressing this by exploring possible strategies for retrieving evicted tokens or enabling the model to memorize previous conversations.

The research led by Guangxuan Xiao, an electrical engineering and computer science (EECS) graduate student, will be presented at the International Conference on Learning Representations. It was carried out in collaboration with his advisor, Song Han, a member of the MIT-IBM Watson AI Lab and a distinguished scientist at NVIDIA, among others from Meta AI and Carnegie Mellon University.

Leave a comment

0.0/5