The rapid growth of digital text in different languages and scripts presents significant challenges for natural language processing (NLP), particularly with transliterated data where performance often degrades. Current methods, such as pre-trained models like XLM-R and Glot500, are capable of handling text in original scripts but struggle with transliterated versions. This not only impacts their usefulness in cross-lingual tasks but also hinders their efficiency in multilingual settings.
To tackle this challenge, researchers from the Center for Information and Language Processing, LMU Munich, and Munich Center for Machine Learning (MCML) have developed a new framework called TRANSMI. This system aims to enhance multi-lingual pre-trained language models (mPLMs) for transliterated data, without the need for further training. TRANSMI modifies existing mPLMs through the use of three merge modes — Min-Merge, Average-Merge, and Max-Merge. These modes enable the integration of transliterated subwords into the mPLMs’ vocabularies, addressing issues such as transliteration ambiguities and improving performance in cross-lingual tasks.
TRANSMI has shown particular effectiveness in Max-Merge mode for high-resource languages. It introduces new subwords designed specifically for transliterated data, adding them to the mPLMs’ vocabularies. The framework has been tested using a wide range of scripts such as Cyrillic, Arabic, and Devanagari. The results demonstrate that TRANSMI-modified models outperform their original versions in many ways. These include better performance in sentence retrieval, text classification, and sequence labeling tasks, indicating a significant enhancement in multilingual NLP applications.
Moreover, the validation datasets span a variety of scripts. For example, the FURINA model, when processed through Max-Merge, showed significant improvements in sequence-labeling tasks. This showcases TRANSMI’s capability to handle different phonetic scripts and reduce issues arising from transliteration ambiguities.
TRANSMI-modified models display high levels of accuracy when compared to their unmodified equivalents. For instance, as with the FURINA model, sequence labeling tasks across different languages and scripts have been improved using Max-Merge. The successful performances of these modified models highlight the framework’s potential as an effective tool for enhancing multilingual NLP models. This improvement ensures better handling of transliterated data, leading to more accurate cross-lingual processing.
In conclusion, TRANSMI provides a practical solution to the problem of improving mPLMs’ performance on transliterated data. By modifying existing models rather than requiring additional training, the framework has significantly enhanced mPLMs’ ability to process transliterations. This has led to substantial improvements in cross-lingual tasks, paving the way for further advancements in multilingual NLP and contributing to enhancing global communication and information processing.