Harnessing high-dimensional clinical data (HDCD) – health care datasets with significantly higher variables than patients – for genetic discovery and disease prediction poses a considerable challenge. HDCD analysis and processing demands immense computational resources due to its rapidly expanding data space. This further complicates interpreting models based on this data, potentially hindering clinical decisions. Traditional disease labels incapable of reflecting complex biological traits and the challenge in amassing substantial datasets with comprehensive disease labels limit the efficient utilization of HDCD in genomic studies.
Addressing this problem, GoogleAI researchers have developed a novel method known as REpresentation Learning for Genetic discovery on Low-dimensional Embeddings (REGLE). The new approach harnesses HDCD, including spirograms, photoplethysmograms (PPGs), and imaging data, transformative for genetic discovery and disease prediction. Current genomic study methods lean heavily on genome-wide association studies (GWAS) operating on expert-defined features extracted from HDCD. However, these methods undergo challenges like high computational expenses, high multiple-testing burdens, and limited ability to divulge complex genetic associations.
REGLE employs unsupervised representation learning to transform HDCD into lower-dimensional embeddings without requiring disease labels. With variational autoencoder (VAE), it can learn non-linear, low-dimensional, disentangled representations of HDCD. The approach integrates expert-defined features (EDFs), if available, advancing efficient and thorough genetic analyses. REGLE’s significant steps include learning HDCD embeddings through VAE, employing GWAS on these embeddings for genetic association identification, and establishing polygenic risk scores (PRSs) from the embeddings for specific trait or disease predictions.
Substantial improvements were seen when REGLE was used to validate two HDCD types – spirograms and PPGs. REGLE detected new genetic loci related to cardiovascular and lung functions that were previously unidentified via traditional methods. It uncovered 45% more significant loci for PPG data, enhancing risk prediction for asthma and COPD when compared to methods reliant on EDFs or principal component analysis (PCA). The method also provided more interpretable results, focusing on features like airway obstruction generally underrepresented by standard EDFs.
As a state-of-the-art solution for genetic studies using HDCD, REGLE makes use of unsupervised learning to unearth hidden genetic signals and improve disease prediction. By eliminating the need for extensive disease labels and incorporating expert features, it overcomes the constraints of traditional methods. The advancements in risk prediction and unique loci discovery emphasize REGLE’s potential to propel genomic research and boost personalized medicine by providing a more comprehensive HDCD analysis.