Deep learning architectures require substantial resources due to their vast design space, lengthy prototyping periods, and high computational costs related to large-scale model training and evaluation. Traditionally, improvements in architecture have come from heuristic and individual experience-driven development processes, as opposed to systematic procedures. This is further complicated by the combinatorial explosion of possible designs and a lack of reliable prototyping pipelines.
Despite this, most models use standard transformer models, alternating between memory-based and memoryless mixers. These models are effective in context and factual recall tasks. However, artificial intelligence researchers from various universities and institutes proposed an approach called mechanistic architectural design (MAD) for faster architecture prototyping and testing. These tests focus on critical architectural characteristics and require minimal training time.
The researchers evaluated unfamiliar and well-known computational primitives using MAD, including gated convolutions, gated input-varying linear recurrences, and mixtures of experts (MoEs). They used MAD to filter potential candidates for architecture, leading to new design optimization strategies such as ‘striping’ which involves creating hybrid architectures through sequentially interleaving blocks made of varying computational primitives.
The research team explored the correlation between MAD synthetics and real-world scaling by training 500 language models across a range of diverse architectures and parameter volumes. Their investigation found that hybrid designs performed better than non-hybrid models in terms of scaling and were more resilient to extensive pretraining runs outside optimal frontiers. The results also showed a connection between recall capabilities, inference efficiency, and memory cost, and the size state in MAD.
Further, the team proposed a state-optimal scaling methodology to estimate complexity scaling with the state dimension of diverse model designs. They were able to create innovative hybrid architectures using MAD that strategically balance complexity, state size, and computing requirements. These architectures achieved 20% less perplexity while maintaining the same computing budget as top transformer, convolutional, and recurrent baselines.
The researchers contend that this methodology could allow for more efficient architecture design, particularly for models belonging to the same architectural class. The team’s findings are particularly noteworthy for machine learning and artificial intelligence, demonstrating that a well-chosen set of simulated MAD tasks can accurately predict scaling law performance. The findings are a stepping stone towards faster, automated architecture design.