Distributed deep learning for extensive language models (LLMs) continues to make considerable advancements, primarily since the unveiling of ChatGPT in 2022. These models continue to grow with billions or trillions of parameters, which often cannot fit within a single acceleration device or node due to memory constraints. Hence, customers must distribute their workloads across hundreds or thousands of GPUs.
Addressing these challenges, the distributed training community has introduced 3D parallelism and other techniques. In 2023, Amazon announced the release of the SageMaker Model Parallel Library 2.0 (SMP), which integrates with the open-source PyTorch Fully Sharded Data Parallel (FSDP) APIs, enabling the training of large models and unlocking tensor parallelism techniques for the first time.
This post discusses the performance benefits of Amazon SageMaker and demonstrates the performance of SageMaker using benchmarks on clusters up to 128 instances for the Llama 2 model. The post shows near-linear scaling efficiencies and the contribution of each feature for optimal throughput.
The post further illustrates the performance in SageMaker using a fixed model size of 70B for the Llama 2 model. The latest release of SMP and SMDDP supports multiple features, including native PyTorch FSDP, hybrid sharding, transformer engine integration, tensor parallelism, and optimized all gather collective operation.
Additional topics covered in the post include SMDDP enhancement over NCCL with FSDP full sharding, replacement of FSDP full sharding with hybrid sharding, throughput boost with the Transformer Engine. It also discusses training with long sequences with SMP tensor parallelism.
The post concludes by demonstrating efficient LLM training with SMP and SMDDP on p4d instances. SageMaker continues to be a vital tool for LLM researchers and practitioners, and the team encourages readers to contact them for more information.