Fusion oncoproteins, proteins formed by chromosome translocations, play a critical role in many cancers, especially those found in children. However, due to their large and disordered structures, they are difficult to target with traditional drug design methods. To tackle this challenge, researchers at Duke University have developed FusOn-pLM, a novel protein language model specifically tailored for fusion oncoproteins.
The new model enhances the current ESM-2 protein language model (pLM) by focusing on key residues related to protein behavior, outperforming the base ESM-2 model and other competing models. FusOn-pLM’s training dataset was meticulously curated, including 41,420 sequences from the FusionPDB database and 4,536 from the FOdb database. This data was subjected to a rigorous clustering process, and the resulting clusters were divided into training, validation, and testing sets.
FusOn-pLM’s performance was evaluated using various benchmarks, including predicting the cellular localization of fusion oncoproteins and their associations with specific cancers. A targeted probabilistic masking strategy was employed during the training process, which particularly focused on amino acids typically involved in protein-protein interactions. This strategy enhanced the model’s ability to recognize the interaction points within the fusion oncoproteins and fine-tuned the advanced ESM-2-650M model.
The FusOn-pLM model’s embeddings were also evaluated and delivered superior performance in predicting the behavior and properties of fusion oncoproteins compared to other models. They excel in predicting fusion oncoproteins’ propensity to form puncta, identifying their localization within the cell, and recognizing intrinsically disordered regions and their physicochemical properties.
In conclusion, FusOn-pLM represents a significant advancement in the biologics field for treating cancers associated with fusion proteins. Unlike traditional models, FusOn-pLM excels in tasks related to fusion oncoproteins and effectively differentiates these proteins from their individual components. Looking ahead, the researchers hope to use FusOn-pLM to design targeted protein degraders and incorporate post-translational modifications for more accurate therapeutic interventions. The FusOn-pLM model has been made publicly accessible for further research and application.