Skip to content Skip to footer

BRIA AI implemented distributed training in Amazon SageMaker to educate latent diffusion baseline models for business operations.

BRIA AI 2.0 is a high-resolution (1024×1024) text-to-image diffusion model. It was trained by BRIA AI on a dataset of licensed images, through a quick and economic process with the assistance of Amazon SageMaker, a platform that offers tools and workflows to build, train, and deploy machine learning models. BRIA AI specializes in generative artificial intelligence (AI) for developers, and its advanced models are trained exclusively on licensed data obtained from partners.

The company faced four key challenges while working with AWS: maintaining operational excellence for large model training, reducing time-to-train through incorporating data parallelism, maximizing GPU utilization through efficient data loading, and reducing model training cost. BRIA AI relied on the use of SageMaker’s data distribution, AlReduce operation, local file access, and automatic retrying of failed jobs to overcome these hurdles.

Data pre-processing involves contributors uploading raw image files to BRIA AI’s Amazon Simple Storage Service (Amazon S3) bucket. Through Amazon Simple Queue Service (Amazon SQS) and AWS Lambda, these images are processed and transformed into large webdataset files. These files can then be streamed directly from the S3 bucket for model training use.

The model training procedure includes the distribution of training jobs and data streaming from the S3 to the training instances through SageMaker’s FastFile mode. To ensure cost-effectiveness, BRIA AI centered its training strategy around three stages of resolution for optimal model convergence. When training was required to be paused for adjustments or troubleshooting, SageMaker helped to minimize cost by only requiring payment for the duration of active training time.

A significant decrease in scaling out training in the cloud was observed when BRIA AI continued its use of HuggingFace Accelerate, a library that enables PyTorch code to run across a distributed configuration. The ShardingStrategy.SHARD_GRAD_OP feature was used by BRIA AI to optimize batch size and expedite the training process.

The high-throughput AI training at BRIA AI achieved an average GPU utilization of over 98%, indicating maximum utilization of GPUs during the training cycle. The generated images clearly demonstrated the transformative power of the SageMaker-enabled BRIA AI 2.0 model.

Amazon SageMaker facilitated BRIA AI in training a diffusion model efficiently, without any need for manual provision and configuration of infrastructure. The ease of use, automation, and cost-effectiveness offered has made SageMaker an attractive solution for large-scale AI model training.

Leave a comment

0.0/5