The evolution of machine learning algorithms has led to speculations about job displacement, with AI demonstrating capabilities that outperform human expertise in some arenas. Nevertheless, claims have been made that humans would remain vital, especially in tasks requiring fewer examples to learn from, like identifying rare diseases in diagnostic radiology or managing unusual scenarios for self-driving cars. To test the accuracy of these assertions, a team of researchers from MIT and Harvard Medical School compared the performance of human radiologists, CheXpert (a supervised learning algorithm), and CheXzero (a zero-shot learning algorithm) in diagnosing rare diseases.
CheXzero, trained on the MIMIC-CXR dataset, uses contrastive learning to predict multiple pathologies, while CheXpert, trained on Stanford radiographs, diagnoses twelve different illnesses with explicit labels. The comparison involved 227 radiologists assessing 324 cases from Stanford, and the concordance statistic was used as the performance metric.
The results revealed that while the average pathology prevalence was very low at around 2.42%, with some being over 15%, CheXpert slightly outperformed both human radiologists and CheXzero. Radiologists exhibited an average concordance of 0.58, indicating a lower performance than both AI algorithms. It was also observed that human and CheXzero performances were weakly related, illustrating distinct focus areas in X-ray analysis.
Study findings state that CheXzero was able to cover 79 pathologies, outshining CheXpert, illustrating how most relevant pathologies are not covered by examined supervised learning algorithms. CheXpert demonstrated significant improvement in higher prevalence instances, while CheXzero’s performance was consistently superior to human performance across all disease prevalence categories.
While the study reveals that self-supervised algorithms are swiftly closing the gap with human experts in identifying rare diseases, it also cautions that it is challenging to convert algorithm outputs into diagnostic decisions, especially concerning rare diseases. This implies that while these algorithms are valuable, they are more likely to collaborate with, rather than supplant, human diagnosticians. Furthermore, the critical areas requiring solutions for successful algorithm deployment are also underscored.