Scientists from the McGovern Institute for Brain Research at MIT, the Broad Institute of MIT and Harvard, and the National Center for Biotechnology Information at the National Institutes of Health, have developed a new search algorithm to find enzymes of interest in vast microbial sequence databases. This algorithm, called Fast Locality-Sensitive Hashing-based clustering (FLSHclust), discovered 188 new rare CRISPR systems in bacterial genomes from a wide range of environments. The systems showed substantial potential for gene editing, with diverse applications ranging from making DNA edits in human cells to diagnostics or cellular activity recording.
The FLSHclust algorithm leverages big-data clustering to swiftly search huge volumes of genomic data. The scientists applied the technique to inspect three major public databases, finding data from unusual bacteria that inhabit environments as diverse as coal mines, breweries, and dog saliva. The researchers unearthed surprising diversity within the CRISPR systems, which could allow more precision and fewer off-target effects than the existing Cas9-based solutions.
Among the new systems, several new variants of Type I CRISPR systems were singled out, containing guide RNA that is 32 base pairs long instead of the 20-nucleotide guide in Cas9. This could potentially lead to more accurate gene-editing technology less likely to stray off-target. The new systems could also potentially serve diagnostic applications or act as molecular records of cellular activities.
The FLSHclust algorithm, based on the big data method of locality-sensitive hashing, groups together similar but not identical objects, allowing efficient examination of billions of protein and DNA sequences within a much shorter time frame than other strategies.
Most of the discovered CRISPR systems belonged to novel or existing categories and were found in unusual bacteria, emphasizing the importance of biodiversity for this field of study. According to the researchers, the algorithm could significantly contribute to research involving large databases, including new gene discovery or studying protein evolution. The project received generous support from various donors, foundations, and institutes.