Large language models (LLMs) like GPT-4 have demonstrated impressive performance in various tasks, ranging from summarizing news articles to writing code. However, concerns propagated by two crucial issues: hallucination and performance disparities. Hallucination describes the tendency of LLMs to generate plausible yet inaccurate text, posing a risk in tasks that require accurate factual recall. Performance disparities refer to the inconsistency in LLM reliability across varying subsets of inputs, often related to sensitive attributes such as racial or ethnic background, gender, or language type.
Researchers from the University of Maryland and Michigan State University developed a benchmark named WorldBench to explore potential geographical disparities in LLM factual recall capabilities. This benchmark uses 11 diverse indicators across nearly 200 countries to evaluate the proficiency of LLMs in fact recollection. It was applied to 20 state-of-the-art LLMs released in 2023, including both open-source and commercial models.
WorldBench features several unique advantages, including equitable representation of all countries, reliable data from a reputable source (the World Bank), and a diverse selection of 11 indicators. The evaluation process involves a standardized prompting method and an automated parsing system, where absolute relative error is used to compare LLMs.
The findings revealed significant geographical disparities across different regions and income groups. North America and Europe & Central Asia experienced the lowest error rates, while Sub-Saharan Africa had the highest. High-income countries consistently recorded lower error rates compared to low-income countries. Even within countries, pronounced disparities were observed, with error rates tripling between high-income and low-income groups. These geographical biases were consistent across all evaluated LLMs and all 11 utilized indicators.
This study is crucial in exposing the inherent geographic biases in LLMs, potentially aiding in the development of future models that exhibit fairness across all regions and income levels. By utilizing World Bank data, the WorldBench allows a flexible and continuously updated framework for assessing these disparities in LLMs, thus working towards creating more globally inclusive, fair language models. It signifies a critical move to address the biases that can compromise the effectiveness and equitable application of AI technologies across various global contexts.