Researchers from IBM Research Europe, the Institute of Computational Life Sciences at Zürich University of Applied Sciences, and Yale School of Medicine have evaluated the progress of computational models which predict TCR (T cell receptor) binding specificity, identifying potential for improvement in immunotherapy development.
TCR binding specificity is key to the adaptive immune system. T cells help to manage targeted immune responses through distinctive TCRs that identify antigens deriving from pathogens or diseased cells. TCR diversity is manufactured via random DNA rearrangement involving V, D, and J gene segments, facilitating the detection of diverse antigens.
While TCR diversity has the potential to be incredibly high, diversity in an individual ends up being much smaller. TCRs react with peptides on the pMHC (major histocompatibility complex), with certain TCRs recognising numerous pMHC complexes. The evaluation of computational models aims to enhance the predictability of this interaction and open new doors in immunotherapy development.
The study examines early unsupervised clustering techniques, supervised machine learning models, and the revolutionising impact of Protein Language Models (PLMs) in bioinformatics, specifically in TCR specificity analysis. The review raises dataset biases, extrapolation issues, and model validation deficiencies, as well as underlining the importance of refining model interpretability and extracting biological insights from large, complex models.
Although bulk sequencing is cost-effective and permits high data output, it cannot detect paired α and β chains. On the other hand, the single-cell technologies capable of detecting the pairings are costly and underrepresented in datasets. Moreover, the majority of the datasets focus on only a limited variety of epitopes, primarily of viral origin and linked to common HLA alleles, showing pronounced bias.
The introduction of PLMs has marked a significant evolution in TCR specificity prediction. The BERT-based TCR-BERT and STAPLER models, trained on extensive protein sequence datasets, have been used for TCR and antigen classification and demonstrate how PLMs can capture complex sequence interactions to achieve outstanding performance.
In spite of the advancements, challenges still exist in regards to ambiguous terminology and making model interpretability clearer. Future improvements in incorporating optimization and interpretability methods specific to protein sequences are integral to further advancements in TCR specificity prediction.
At present, accurate TCR specificity prediction is vital for advancing immunotherapies and understanding autoimmune diseases. Current models are challenged by limited and biased data, which has hindered generalizability to new epitopes. However, progress in machine learning has significantly enhanced TCR prediction models, but difficulties still lie in predicting specificity for novel epitopes.