Language modeling, a key aspect of machine learning, aims to predict the likelihood of a sequence of words. Used in applications such as text summarization, translation, and auto-completion systems, it greatly improves the ability of machines to understand and generate human language. However, processing and storing large data sequences can present significant computational and memory challenges, hindering real-time processing and scalability.
Traditional approaches to language modeling, like the Transformer architecture, have helped to overcome some of these issues. The Transformer’s self-attention mechanism can process word sequences irrespective of distance, and adaptations like the decoder-only Transformer have improved text generation. Sparse Transformers limit interactions between distant sequence components to reduce computational demand, and hybrid models like BERT and T5 combine different architectural strengths for more efficient and nuanced language models.
Microsoft Research and Tsinghua University have developed a novel architecture called You Only Cache Once (YOCO) for large language models. Unlike traditional approaches, YOCO caches key-value pairs only once, resulting in reduced computational overhead and memory use. YOCO efficiently processes lengthy sequences by using precomputed global KV caches, a self-decoder, and a cross-decoder.
Within YOCO, the self-decoder uses a sliding window and gated retention attention to create a compact set of KV pairs. The cross-decoder reuses these pairs, eliminating the need to re-encode and preserving computational resources. When tested on multiple datasets, the YOCO model demonstrated improved processing speeds and memory efficiency compared to popular Transformer-based models.
Experimental results showed that YOCO is capable of almost perfect retrieval accuracy for sequences up to one million tokens in length. It greatly reduces the GPU memory demands of 65-billion-parameter models (by approximately 80 times), and the prefilling latency for contexts of up to 512,000 tokens drops from 180 seconds to under 6 seconds. Furthermore, YOCO substantially increases throughput compared to the traditional Transformer, providing a 9.6 times increase from 4.5 to 43.1 tokens per second.
In summary, YOCO represents a significant advancement in language modeling by reducing computational overhead and memory usage. It applies an innovative decoder-decoder framework and efficient attention mechanisms to process long sequences more effectively, achieving almost perfect retrieval accuracy and dramatically lower latency and memory demands. By providing a scalable, efficient solution for deploying large language models, YOCO has significant potential benefits for a range of real-world applications.