Transformer models have ushered in a new era of Natural Language Processing (NLP), but their high memory and computational costs often pose significant challenges. This has fueled the search for more efficient alternatives that uphold the same performance standards but require fewer resources. While some research has been conducted on Linear Transformers, the RWKV model, RetNet, and others – all offers competitive performance in terms of efficiency, none provide a complete solution.
This gap prompted researchers at the Toyota Research Institute to develop the Scalable UPtraining for Recurrent Attention (SUPRA) method. SUPRA is designed to convert pre-trained transformers into recurrent neural networks (RNNs), which intrinsically require less computational effort. This innovative approach achieves considerable performance while greatly reducing computing costs.
SUPRA operates by replacing softmax normalisation, which is traditionally used in transformers, with a statistical process known as GroupNorm. It also includes a small multi-layer perceptron (MLP) for projecting queries and keys. The models were trained using a large dataset named RefinedWeb, comprised of 1.2 trillion tokens. This approach enables transformers to operate in a recurrent and efficient manner, handling both short and long-context tasks.
SUPRA exhibited impressive performance metrics on various benchmarks. On the HellaSwag benchmark, SUPRA scored 77.9, outperforming the podium finishers RWKV and RetNet, who scored 70.9 and 73.0 respectively. On other tasks too, SUPRA showed strong results, bagging scores of 76.3 on ARC-E, 79.1 on ARC-C, and 46.3 on MMLU. What stood out is that all these benchmarks required only 20 billion tokens, significantly less than their counterparts.
However, it’s worth noting that SUPRA did experience performance drops in long-context tasks, but it maintained robust results within its training context length. As such, this innovative approach has transformed an NLP bottleneck into a scalable, highly efficient solution. Despite some of its limitations, SUPRA’s approach of converting pre-trained transformers into efficient RNNs may well pave the way for more cost-effective natural language processing tools in the future.
The Toyota Research Institute deserves credit for taking a pivotal step forward in NLP research with SUPRA, and remains committed to exploring new ways of addressing the high computational costs traditionally associated with transformers. This research not only offers potential solutions for scalable NLP models but also promises more accessible advanced language processing technologies in the future.