Researchers from MIT have devised a method called StreamingLLM which enables chatbots to maintain long, uninterrupted dialogues without crashing or performance dips. It involves a modification to the key-value cache at the core of many large language models which serves as a conversation memory, ensuring the initial data points remain present. The method facilitates a chatbot to remain functional even when a conversation exceeds 4 million words and its performance is over 22 times faster than other methods. The application potential includes tasks like copywriting, editing, or generating code.
Large language models translate data into tokens, employing an attention mechanism which uses these tokens to produce new text. An ‘attention map’ then defines how each token relates to each other, leading to human-like text. Problems arise when a representation called KV Cache that stores recent tokens exceeds capacity, resulting in older tokens being bumped out leading to crashing or diminished performance. The StreamingLLM solution sidesteps this issue by retaining the first token – referred to as the ‘attention sink’ – in the sliding cache, showing a continuity in performance even when the cache size is exceeded.
The researchers also examined the use of attention sinks during model training. They found that training with attention sinks permitted a model to maintain performance with just one attention sink in its cache. Even so, despite the apparent benefits of the StreamingLLM method, it is unable to remember words not stored in the cache. As such, future work will concentrate on solving this problem. Meanwhile, the StreamingLLM method has been integrated into NVIDIA’s large language model optimization library, TensorRT-LLM.