Transformer architecture has greatly enhanced natural language processing (NLP); however, issues such as increased computational cost and memory usage have limited their utility, especially for larger models. Researchers from the University of Geneva and École polytechnique fédérale de Lausanne (EPFL) have addressed this challenge by developing DenseFormer, a modification to the standard transformer architecture, which boosts model perplexity without increasing model size. DenseFormer integrates Depth-Weighted-Average (DWA) modules post each transformer block, facilitating a consistent flow of information and improving data efficiency. Drawing inspiration from DenseNets, the model uses weighted calculations of past block outputs as inputs for subsequent blocks, significantly enhancing its compactness and speed during inference.
Traditional transformer adaptations focus on internal alterations, but DenseFormer functions between blocks, proving its compatibility with existing models. Hardware efficiency considerations have resulted in negligible overheads, and other model approaches, such as mixtures of experts, also profit from DenseFormer’s adaptability.
DenseFormer also incorporates DWA modules post each transformer block. The modules initiate weighted averages between the current block’s output, previous block outputs, and the initial embedded input, reducing computational costs. Moreover, researchers have introduced a variant model, the Dilated DenseFormer, which designated DWA weights by setting them periodically to zero. This variant led to substantial computation savings without noticeable degradation in performance.
In language modeling tasks, DenseFormer outperformed standard transformer architectures in various metrics, including model size, inference time, and perplexity. Experiments using other methods of DenseFormer, such as dilation and DWA period fluctuations, also showed positive impacts on model efficiency. For instance, a dilation of four and a DWA period of five yielded the best trade-off between speed and perplexity across different datasets and sequence lengths.
In summary, DenseFormer modifies the standard transformer architecture to ensure better performance in NLP tasks. It optimizes the balance between perplexity and speed by using DWA modules to allow direct access to the outputs of previous blocks. Additionally, methods like dilation and DWA periodicity further enhance efficiency without compromising performance. DenseFormer opens a promising path towards more efficient natural language processing tasks in the future. Subsequent studies aim to optimize DenseFormer’s implementation, explore effective sparsity patterns, and build scalable, distributed training methods.