In recent years, comparisons have been made between protein sequences and natural language due to their sequential structures, facilitating notable progress in deep learning models in both areas. Large language models (LLMs), for example, have seen significant success in natural language processing (NLP) tasks, prompting attempts to adapt them to interpret protein sequences.
However, these efforts have been hindered by a lack of direct correlations in existing datasets between protein sequences and their text descriptions. Constructive training and evaluation become subsequently challenging. Despite the advancement of multimodal machine learning models (MMLMs), the missing comprehensive datasets that combine protein sequences with text still prevent holistic use of these models in protein science.
To tackle this problem, researchers from institutions including Johns Hopkins and UNSW Sydney have developed the ProteinLMDataset to boost LLMs’ understanding of protein sequences. This dataset contains an enormous 17.46 billion tokens catering for self-supervised pre-training and 893,000 instructions for supervised fine-tuning. The team also created ProteinLMBench, the pioneering benchmark with 944 manually audited multiple-choice questions to determine the proficiency of LLMs’ comprehension of proteins. Both the dataset and benchmark aim to bridge the gap in integrating protein-text data, thus enabling LLMs to comprehend protein sequences without the need for additional encoders and to generate accurate protein knowledge using the innovative Enzyme Chain of Thought (ECoT) method.
A review of relevant literature underlines the specific limitations of existing datasets and benchmarks for both NLP and protein sequence interpretation. For instance, the benchmarks used in Chinese-English datasets need to become more comprehensive and interpretable and should be extended beyond the geographical boundaries they’re often confined to. Major protein sequence dataset resources such as UniProtKB and RefSeq struggle to represent protein diversity fully and annotate data accurately due to the biases and errors likely to arise from community contributions and automated systems.
The newly created ProteinLMDataset is split into self-supervised and supervised sections. The former features large quantities of tokens from Chinese-English scientific texts, as well as protein sequence-English text pairings from PubMed and UniProtKB. The supervised fine-tuning section consists of 893,000 instructions spread across seven segments, such as enzyme functionality and disease involvement. These are predominantly sourced from UniProtKB. The evaluation benchmark, ProteinLMBench, contains 944 carefully selected multiple-choice questions related to protein properties and sequences.
Proving the value of this dataset, when the InternLM2-7B model was trained on it, it outperformed GPT-4 in protein comprehension tasks. On the whole, the ProteinLMDataset and ProteinLMBench create a robust architecture for training LLMs to interpret protein sequences and bilingual texts. By including diverse sources such as Chinese-English text pairings, the dataset enhances the comprehension of protein characteristics across languages. Experiments have shown that accuracy significantly improves when using both self-supervised and supervised datasets. This groundbreaking work could be game-changing for biological research and its potential applications.