Scientists at the Massachusetts Institute of Technology (MIT) have developed a computational tool that can predict mutations to help create better proteins. The tool facilitates the creation of improved versions of proteins through strategic mutations and could offer significant advancements in neuroscience research and medical applications. One common procedure for producing improved proteins involves introducing random mutations to a natural protein over multiple rounds until an optimized version of the protein is generated. While this method has been successful for many proteins, others have proven more difficult to optimize.
The model, developed at MIT, identified mutations that could improve green fluorescent protein (GFP) and a protein from adeno-associated virus (AAV), a vehicle for gene therapy. The team trained a type of model known as a convolutional neural network (CNN) on experimental data from GFP sequences and their corresponding brightness. The CNN could draw a fitness landscape — a three-dimensional map showing how much a protein differs from the original sequence and a measure of the protein’s fitness — based on relatively limited data, taken from about 1,000 variations of GFP.
Though this landscape can include peaks that represent fitter proteins and valleys for less fit ones, determining the path a protein must follow to reach optimum fitness can be challenging, as it may require less fit mutations first. The team used an existing technique to smooth the landscape, retrained the CNN, and found it reached higher fitness peaks more easily. The modified model was able to predict advanced GFP sequences with seven different amino acids from the original protein sequence.
The researchers also successfully applied the technique in identifying new sequences for the viral capsid of adeno-associated virus (AAV), often used to deliver DNA. Furthermore, the tool was able to optimize the AAV capsid for its DNA packaging capability.
Although GFP and AAV served as proofs of concept and are well-characterized, the tool is expected to be useful for other protein engineering efforts. It is hoped that the technique can be used on data on voltage indicator proteins, which are crucial for observing neuron activity in mammalian cells without the need for electrodes.
The tool offers hope that using smaller datasets could establish a model in silico, potentially leading to better predictions than two decades of manual testing. While many labs have been working on this issue for decades, no satisfactory solution has been found as yet.
The study, by researchers from many centers and under diverse missions at MIT, is supported by several funding bodies including the National Science Foundation, the Howard Hughes Medical Institute, the U.S. Office of Naval Research, the Machine Learning for Pharmaceutical Discovery and Synthesis consortium, and the DARPA Accelerated Molecular Discovery program.