Measurement of large language models’ (LLMs) performance is a crucial component of the fine-tuning and pre-training stages in the process prior to deployment. Frequent and rapid validation of their performance enhances the likelihood of improving the language model’s performance. In partnership with Gradient, a service involved with the development of personalized LLMs, the challenge of evaluating these language models was tackled. It was found that with the mainstream instrument for LLM evaluation, lm-evaluation-harness, the issues of VRAM restriction and GPU instance availability were stumbling blocks.
The solution integrated lm-evaluation-harness with AWS Neuron, the software behind AWS Inferentia and Trainium, allowing for the early version of the model (v-alpha-tross) to be benchmarked against other models both during and post-training. This software integration resulted in the abstract estimation of sequences’ log likelihood and tokens’ inference, without interfering with the actual evaluation task. This allowed a shift in the internal testing pipeline towards Amazon EC2 Inf2 instances, triggering shared access to 384 GB of accelerator memory and enabling frequent multiple-instance testing.
The ensuing benchmark test, whose goal was to generate scores identical to the Open LLM Leaderboard while maintaining the flexibility for private benchmarking, used code changes to port a model from Hugging Face transformers to a drop-in replacement, without requiring a precompiled model. Despite expecting minor score variations between different runs, the variation was found to be quite close to zero in terms of the standard deviation. This positive outcome was validated by the Hugging Face leaderboard.
The testing with the new harness required an application for service quotas and region-specific allocation of On-Demand Inf instances. Depending on the model size, the instance was then selected, with v-alpha-tross on an Inf2.48 instance and mistralai/Mistral-7B-v0.1 on an inf2.xlarge instance. After model deployment, the next step involved cloning and installation of lm-evaluation-harness on the instance before running the lm_eval with the hf-neuron model type.
A note of caution to stop EC2 instances once done to prevent unnecessary costs was reinforced. AWS Inferentia2 and the Gradient-Neuron teams have endorsed the broader adoption of LLM evaluation, excited about increased utility of AWS Inferentia2.
Among the authors of the post are Michael Feil, an AI engineer at Gradient, and Jim Burtoft, a Senior Startup Solutions Architect at AWS.