Skip to content Skip to footer

Innovating big language model training with Arcee and AWS Trainium

Arcee, an artificial intelligence (AI) company, has made strides in optimizing the training of Large Language Models (LLMs) using continual pre-training (CPT) and model merging strategies. Its advancements are particularly significant in niche fields like medicine, law, and finance. The process was expedited by its partnership with AWS Trainium, a cloud platform that provides affordable access to high-performance compute instances.

Continual Pre-Training (CPT) involves using domain-specific data sets to extend the training of base models such as Llama 2. Arcee refined this method by employing Trainium chips to train a Llama 2 model on a PubMed dataset made up of 88 billion tokens. This milestone resulted in significant improvements in the models’ performance.

The use of Trainium for pre-training is imperative in a landscape where more developers are employing generative AI and LLMs for their applications. Traditional use of GPUs has slowed model construction innovations due to cost and availability. But Trainium grants developers access to high-efficiency model training accelerators that can reduce training costs by up to 50%, compared to Amazon Elastic Compute Cloud instances.

Arcee used AWS ParallelCluster to create a High-Performance Computing environment for running their distributed training task, featuring 16 nodes each equipped with a trn1n.32xlarge instance with 32 GB of VRAM. AWS’s Neuron SDK facilitated effortless distributed training and supported ML frameworks like PyTorch and TensorFlow.

The study monitored the perplexity of a held-out PubMed dataset across various checkpoints obtained during training, revealing consistent improvement in the model’s performance over time. The continual pre-training on domain-specific raw data resulted in the enhancement of the Llama 2 7B checkpoint and improved perplexity of the model on the PMC test set.

In conclusion, the combination of Trainium’s advanced capabilities and Arcee’s pioneering model training strategies offers potential to revolutionize how LLMs are developed. This makes it more accessible, less expensive, and adaptable for the needs of various industries. With Trainium’s scalability, efficiency, and high-level security features, new avenues for innovation and practical applications in AI-driven industries are being opened up.

Leave a comment

0.0/5