Skip to content Skip to footer

Effectively adjust the ESM-2 protein language model utilizing Amazon SageMaker.

This article talks about how to use the Amazon SageMaker to effectively fine-tune a state-of-the-art protein language model (pLM) to predict protein subcellular localization. Proteins play a critical role in human body function and drug development, often serving as potential targets and therapeutics. They can be effectively analyzed using larger language models (LLMs) to predict things like 3D protein structure and how proteins may interact with other molecules. pLMs have grown tremendously over the years, which has resulted in significant increases in training time and memory required for model parameters. However, using a technique called fine-tuning can save time and resources by adapting an existing pLM for a specific task.

The authors demonstrate the fine-tuning of ESM-2 pLM to predict if a protein lives on the outside (cell membrane) or inside of a cell. They use Amazon SageMaker to download a public dataset and use an efficient training method. Furthermore, they create a training script and utilize different methods to improve efficiency, including weighted training class, gradient accumulation, gradient checkpointing, and Low-Rank Adaptation of LLMs.

Fine-tuning resulted in significant reductions in runtime and GPU memory use, demonstrating its ability to train on a wide range of cost-effective instance types. Though all methods used resulted in high evaluation accuracies, combinations of these methods resulted in slight reductions in accuracy. Nonetheless, the decrease in runtime and GPU memory use were substantially beneficial. This article emphasizes the potential to use existing tools and novel fine tuning techniques to speed up pre-clinical development and trial design, thereby facilitating more efficient research in the field of life sciences.

Leave a comment

0.0/5