The study of evolution by natural selection at a molecular level has witnessed remarkable progress with the advent of genomic technologies. Traditionally, researchers focused on observable traits; however, gene expression offers deeper insights into selection pressures, bridging the gap between genomic data and macro traits. A recent study used RNA sequencing to analyze gene expression of the plant species, Ivyleaf Morning Glory or Ipomoea hederacea, under natural conditions. The study leveraged machine learning to manage high-dimensional, small-sample-size data that characterize transcriptomics. The analysis indicated the importance of genes linked to photosynthesis, stress response, and light response for predicting fitness. This approach showed the potential of machine learning models to identify genes under selection and vital biological processes in a natural environment, overcoming the limitations of conventional statistical methods.
In addition to gene expression, the role of codon usage in evolution is significant too as it varies considerably across and within species. Researchers used artificial intelligence (AI) models, specifically the mBART transformer-based model, to predict codon sequences from given amino acid sequences in several organisms. The results showed that AI can accurately learn and predict codon patterns, especially in highly expressed genes and longer proteins. This indicates that codon choice is influenced by evolutionary pressures linked to protein expression and folding. It illuminates the understanding of codon bias and its effect on protein synthesis, offering valuable tools for applications in biotechnology and synthetic biology.
The investigation worked with National Center for Biotechnology Information (NCBI) coding sequences from various organisms for the study. These were divided into different sets. Codon prediction models, which included frequency-based methods and mBART models, were evaluated using accuracy and perplexity metrics. Both masking and mimicking techniques were utilized in mBART models to predict codon sequences.
Further, tissue was collected from the Ivyleaf Morning Glory for a field experiment. From soil sample analysis to mRNA extraction and sequencing, the experiment employed various analytical and data processing methods. The study also leveraged neural networks and gradient tree boosting for supervised modeling. The findings from these experiments revealed insights such as important correlations between protein expression, codon usage, and functional attributes. High-expression genes or well-conserved proteins had more predictable codon patterns. Also, machine-learning approaches efficiently identified gene expression patterns linked to fitness, especially in genes associated with stress response and reproductive development. These practices underscored the effectiveness of AI in decoding complex biological sequences and improving our understanding of molecular evolution and gene regulation.