Large Language Models (LLMs) have garnered attention recently due to their potential for enhancing a range of industries. At Arcee, the focus is on improving the domain adaptation of LLMs tailored to their client’s needs. Arcee has introduced novel techniques for continual pre-training (CPT) and model merging, significantly advancing LLM training efficiency. These strategies have been particularly impactful within medical, legal, and financial fields. A partnership with AWS Trainium has also been vital, enabling Arcee’s platform to boost model training speed, decrease costs, and maintain compliance and data integrity within a secure AWS environment.
Ensuring efficient CPT is essential for fine-tuning models to fit diverse fields, from medical applications to industrial chip design. To achieve this, Arcee leverages the Trainium team’s advanced technology to enhance a Llama 2 model with a vast PubMed dataset comprising 88 billion tokens. Moreover, Arcee aims to optimize model performance in real-world applications by integrating CPT with Trainium.
While the rising popularity of AI and LLMs has made access to high-performance compute instances increasingly challenging, the introduction of Trainium has provided a cost-effective solution. Developers worldwide can now efficiently build their models without making costly, long-term compute reservations by utilizing Trainium instances. These instances deliver performance alongside the flexibility required to optimize both training efficiency and lower model building costs.
Arcee’s CPT techniques and collaboration with Trainium have expedited the LLM training process, substantially reduced costs, and re-enforced data integrity and security within an AWS environment. Arcee’s innovative approach to CPT and model merging make it possible to swiftly adapt LLMs to highly specialized datasets, thus producing existing, reliable tools for professionals across various fields.
The CPT-enhanced Llama 2 7B checkpoint has led notably to improvement in perplexity of the model on the PMC test set, demonstrating the potential of Arcee’s CPT strategies for continual pre-training on domain-specific raw data. Additionally, the number of trained tokens was positively correlated with enhanced model performance.
The integration of Trainium’s advanced ML capabilities with Arcee’s leading-edge strategies in model training and adaptation is set to transform the landscape of LLM development, making it more accessible, cost-effective, and tailored to the changing demands of diverse industries.
The authors of the article are Mark McQuade, CEO/Co-Founder at Arcee, Shamane Siri PhD, Head of Applied NLP Research at Arcee, and Malikeh Ehghaghi, Applied NLP Research Engineer at Arcee. They have extensive experience in NLP, domain-adaptation of LLMs, and AI research, which they used in the development of Arcee’s innovative methodologies.