Researchers from the University of Toronto and the Vector Institute have developed an advanced framework for protein language models (PLMs), called Protein Annotation-Improved Representations (PAIR). This framework enhances the ability of models to predict amino acid sequences and generate feature vectors representing proteins, proving particularly useful in predicting protein folding and mutation effects.
PLMs traditionally make use of pseudo-likelihood objectives, deriving data from large protein databases. They have been successful due to their ability to detect conserved sequence motifs, critical for protein fitness. However, the relationship between sequence conservation and fitness is often complex due to various evolutionary and environmental factors.
To address this, the researchers integrated text annotations, a wealth of additional data detailing protein functions and structures, into their new model. By training the PLMs with these expertly curated annotations from UniProt, they significantly improved the accuracy of the models.
Employing a text decoder as part of the PAIR framework, the improvements were particularly profound in function prediction tasks, outperforming established algorithms such as BLAST – especially when dealing with proteins showing low sequence similarity to the training data.
Traditionally, protein labelling relied on techniques such as BLAST and Hidden Markov Models (HMMs), which make use of sequence alignment and additional data like protein family and evolutionary information to detect protein sequence homology. These classical approaches, while effective for sequences with high similarity, struggled with remote homology detection.
In response to this limitation, PLMs have been developed, using deep learning techniques to learn protein representations from large-scale sequence data. The PAIR framework builds on these developments, integrating additional modalities of data, and using the SciBERT text decoder to process a protein’s sequence into a continuous representation, while also generating text annotations.
The PAIR framework was tested on nineteen different types of data, with the results indicating that the model could potentially be used to represent other biological entities, including small molecules and nucleic acids.
This new development in the field of bioinformatics underscores the value of integrating diverse data sources to improve protein function prediction. The improvement in performance for sequences with low similarity to training data is particularly important, as this has been a persistent problem area for traditional methods. The success of the PAIR framework in handling these limited data scenarios therefore represents a significant step forward for protein function prediction.