Genomic research, which seeks to understand the structure and function of genomes, plays a significant role in a variety of sectors, including medicine, biotechnology, and evolutionary biology. It provides valuable insights into potential therapies for genetic disorders and fundamental life processes. However, the field also faces major challenges, particularly when it comes to modelling and interpreting complex biological sequences. Traditional genomic modelling approaches typically focus on individual elements, such as proteins or regulatory DNA, and often struggle with intricate, multi-scale interactions.
Acknowledging these difficulties, a team of researchers from Stanford University, Arc Institute, TogetherAI, CZ Biohub, and the University of California, Berkeley, just introduced Evo, a new genomic foundation model. Evo is designed to perform prediction and generation tasks on a scale ranging from individual molecules to entire genomes. Using a unique deep signal processing architecture, it can process large genomic datasets with a high degree of accuracy and detail.
Evo’s architecture, dubbed StripedHyena, combines attention mechanisms and convolutional operators to process long genomic sequences efficiently while maintaining high resolution at the single-nucleotide level. This functionality is critical for capturing detailed variations in genetic sequences. Furthermore, Evo has been trained on an extensive dataset encompassing 300 billion nucleotide tokens from whole prokaryotic genomes. This in-depth training allows the model to learn the intricate patterns of genomic sequences and predict and generate tasks across different molecular modalities.
The model’s performance in zero-shot function prediction and generation tasks is impressive. It has shown an ability to generate synthetic CRISPR-Cas molecular complexes and transposable systems. It can also predict gene essentiality with high accuracy and create coding-rich sequences up to 650 kilobases in length. Additionally, Evo outperforms existing domain-specific language models in several areas, demonstrating its advanced capabilities across various genomic tasks.
From the broader context, Evo presents significant advantages over previous models in genomic research. The model’s ability to perform comprehensive genomic analysis and generation marks a significant leap in the field. Its success in modelling genomic data on a large scale, and its ability to predict and generate complex biological sequences, suggest new possibilities for the future of biological research and synthetic biology.