Microbial sequence databases hold a vast array of information about enzymes and other molecules that could be utilized in biotechnology applications. However, the sheer size of these databases has made it challenging to efficiently search for specific enzymes of interest.
Researchers from the McGovern Institute for Brain Research at MIT, the Broad Institute of MIT and Harvard, and the National Center for Biotechnology Information (NCBI) at the National Institutes of Health have developed an innovative search algorithm that has identified 188 kinds of new rare CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) systems hidden within bacterial genomes. These findings encompass thousands of individual systems and were recently published in the scientific journal, Science.
Fast Locality-Sensitive Hashing-based clustering (FSLHclust), the search algorithm used, utilizes big-data clustering to swiftly scan extensive quantities of genomic data. Using this algorithm, the team mined three major public databases containing data from a vast array of unusual bacteria, including those found in environments as diverse as coal mines, breweries, Antarctic lakes and dog saliva.
The researchers discovered a surprising number and diversity of CRISPR systems, some of which have the capability to edit DNA, others that can target RNA, and many with various other functions. These new systems could potentially be used to edit mammalian cells with fewer off-target consequences than current Cas9 systems. They could also prove useful in diagnostics or serve as molecular records of activity within cells.
The search demonstrated an unprecedented level of diversity and flexibility within the CRISPR systems, leading the researchers to believe there are likely many more rare systems still to be discovered as the databases continue to expand. Co-senior author on the study and pioneer CRISPR researcher Professor Feng Zhang noted the necessity for better tools like FSLHclust to search molecular sequences and find promising discoveries.
To discover novel CRISPR systems within protein and nucleic acid sequence databases, the team created an algorithm based on a technique called locality-sensitive hashing, commonly used in the big data community. This technique clusters together objects that are similar but not identical. Using this approach, the team was able to analyze billions of protein and DNA sequences in just weeks, a task that would have taken months using previous methods that searched for identical objects.
Notably, the researchers found several new variants of Type I CRISPR systems, possessing guide RNA that is 32 base pairs long rather than the usual 20-nucleotide guide found in Cas9 systems. These Type I systems could potentially be used to develop more precise gene-editing technology that is less prone to off-target editing.
The team believes that their new search algorithm could significantly aid the search for other biochemical systems. The study’s findings underscore the immense diversity of CRISPR systems, most of which are rare and found only in unusual bacteria, highlighting the importance of broadening sampling diversity to continue expanding the possibilities of discovery.