Advancements in Large Language Models (LLMs) have significantly improved the field of information extraction (IE), a task in Natural Language Processing (NLP) that involves identifying and extracting specific information from text. LLMs demonstrate impressive results in IE, particularly when combined with instruction tuning, training the models to annotate text according to predefined standards, enhancing their capability to generalize to new datasets.
Despite these developments, LLMs encounter challenges when dealing with low-resource languages, lacking both the unlabeled text needed for pre-training and the labeled data for fine-tuning models. This paucity of data makes it tough for LLMs to achieve good performance with these languages.
Addressing these issues, researchers from the Georgia Institute of Technology have proposed the TransFusion framework. In TransFusion, models are adjusted to operate with data translated from low-resource languages into English. This way, the original low-resource language text and its English translation offer data that helps models to generate more precise predictions.
TransFusion aims to boost IE in low-resource languages by employing external Machine Translation (MT) systems. The framework primarily involves three steps: Conversion of low-resource language data into English (to enable a high-resource model to annotate it), fusion of the original low-resource language text with the annotated English translations in a specially trained model, and construction of a TransFusion Reasoning Chain that integrates annotation and fusion into one autoregressive decoding sequence.
In addition to TransFusion, the research team designed GoLLIE-TF, an instruction-tuned LLM that is cross-lingual and specifically crafted for Internet Explorer tasks. This model is intended to narrow the performance gap between high- and low-resource languages. The collective end-goal of TransFusion and GoLLIE-TF is to increase the efficiency of LLMs in processing low-resource languages.
On testing GoLLIE-TF on twelve multilingual IE datasets (comprising fifty languages), the model performed admirably, demonstrating better zero-shot cross-lingual transfer compared to the base model; it effectively applies its learned skills to new languages without additional training.
Furthermore, the use of TransFusion with private models such as GPT-4 significantly upgraded the performance of low-resource language named entity recognition (NER). Prompting resulted in a performance jump by 5 F1 points for GPT-4, and the fine-tuning of different language model types using the TransFusion framework led to more improvement — 14 F1 points for decoder-only architectures and 13 F1 points for encoder-only structures.
In conclusion, the combination of TransFusion and GoLLIE-TF offers a robust solution for augmenting IE tasks in low-resource languages. By using English translations and modelling fine-tuning to merge annotations, the performance gap between high and low-resource languages is minimized, demonstrating substantial improvements across multiple models and datasets.