Skip to content Skip to footer

MIT researchers have developed a computational model that helps predict mutations leading to better proteins, based on a relatively small dataset. In the current process of creating proteins with useful functions, scientists usually start with a natural protein and put it through numerous rounds of random mutation to generate an optimized version.

This process has led to optimized variants of many crucial proteins, including the green fluorescent protein (GFP). However, for some proteins, generating an optimized version has been challenging. The researchers believe their computational approach could also be used to develop additional neuroscience research tools and other medical applications.

Protein design is difficult because the mapping, from DNA sequence to protein structure and function, is complex. Researchers would need to engineer these proteins for use in mammalian cells to measure neuron activity without utilizing electrodes.

In this study, the researchers overcame these challenges by developing and testing a computational model that used data from GFP to predict better versions of the protein. To begin, the model was trained on experimental data consisting of GFP sequences and their brightness – the feature they wished to optimize.

The model was able to construct a three-dimensional map that showed the fitness of a specific protein, relative to the original sequence. The fitness landscape was generated using a relatively small amount of experimental data from approximately 1,000 GFP variants. However, the process of predicting the path a protein needs to follow to reach the peaks of fitness can be tough.

In response, the researchers used an existing computational technique to “smooth” the fitness landscape. Once the landscape was smoothed, the CNN model was retrained and found that it could reach greater fitness peaks more easily. The model was also able to predict optimized GFP sequences differing by as many as seven amino acids from the protein sequence they began with.

The researchers also demonstrated that this approach was successful in identifying new sequences for the viral capsid of the adeno-associated virus (AAV), a viral vector frequently used to deliver DNA. In this case, they optimized the capsid for its ability to package a DNA payload.

The research team now plans to apply this computational technique to data generated on voltage indicator proteins, which have been a focus of research for decades. The hope is that a smaller dataset could be used to train a model in silico and make predictions that could be superior to the past two decades of manual testing.

Leave a comment

0.0/5