Google researchers have been investigating how large Transformer models can be efficiently used for large natural language processing projects. Although these models have revolutionised the field, they require careful planning and memory optimisations. The team have focused on creating techniques for multi-dimensional positioning that can work for TPU v4 slices. In turn, these have been optimised at a low level, which has enabled them to outperform benchmarks set by FasterTransformer in terms of latency and model FLOPs usage.
In addition to this, the researchers have managed to multiply the scale of context lengths by 32. They achieved a latency level of just 29ms per token with their PaLM 540B model. The model also had a model FLOPs utilisation level of 76%. This model can handle context lengths of up to 2048 tokens, opening up the possibility of being used in chatbots and high-throughput offline interfacing.
Earlier models have used tensor and pipeline parallelism alongside memory optimisations. FasterTransformer and DeepSpeed Inference have set the benchmark in their field, but each had their limitations. For instance, the former has to rely on CPU and NVMe memory as it offloads processes to ZeRO to boost its performance.
The Google team’s novel approach has focused on balancing low latency with high throughput by optimising attention mechanisms and partitioning layouts. They have used techniques such as sparse architectures and adaptive computation that reduce FLOPs per token and chip-to-chip communication. These techniques can provide cost and latency improvements for a variety of digital services.
Despite these measures, there remain profound challenges to the model’s scalability. There are significant communication volumes and FLOP counts that the researchers are trying to navigate and negating these issues remains their core focus.