Traditional training methods for Large Language Models (LLMs) have been limited by the constraints of subword tokenization, a process that requires significant computational resources and hence drives up costs. These limitations result in a ceiling on scalability and a restriction on working with large datasets. Accountability for these challenges with subword tokenization lies in finding strategies to significantly condense text for efficient model training while maintaining, and ideally optimizing, performance.
There are several ways previous researchers have attempted to meet these challenges head-on. For example, the Chinchilla model efficiently compresses data by employing transformer language models. Meanwhile, novel solutions in the field of Arithmetic Coding have been adapted specifically to be LLM-friendly. Other researchers are pioneering a “token-free” modeling paradigm through convolutional downsampling. Different applications of compression algorithms, such as using learned tokenizers in audio compression and the application of GZip’s modeling components for a variety of AI tasks, have also been successful. There has also been research into the use of static Huffman coding combined with n-gram models, a methodology that favors simplicity over maximal compression efficiency.
A novel method introduced by researchers from Google DeepMind and Anthropic involves training LLMs using what they call ‘Equal-Info Windows’ on neurally compressed text. This method accelerates compression rates without sacrificing the learnability or performance of LLMs. It utilizes a two-model system: the first model is a smaller language model known as M1 which compresses text using Arithmetic Coding, and the second is a larger model known as M2, which is trained on the compressed text produced by M1. The procedure works by dividing the text into uniform blocks. Each block compresses to a set bit length, which is then tokenized as data for M2 to train on. The C4 dataset, also known as the ‘Cleaned Common Crawl Corpus’, is used for this model training.
‘Equal-Info Windows’ shines in maintaining performance and efficiency across large datasets by ensuring regular compression rates and providing stable inputs as required by the LLM. The impressive results of this approach suggest that this method substantially outstrips traditional models in terms of performance. LLMs using this technique showed significant improvements in perplexity scores and processing speeds. The use of the ‘Equal-Info Windows’ technique overall reduced perplexity by up to 30% in various tests, and processing speed increased by up to 40% compared to the common setup.
In summary, ‘Equal-Info Windows’ is a ground-breaking method for training and improving the scalability and performance of large language models on compressed text. The tool provides consistent compression rates, and the use of the C4 dataset illustrates its successful application, marking an important breakthrough in modeling methodologies. This research opens new pathways for investigations into data compression and efficient model training.