Understanding the differences between various inference methods is essential for natural language processing (NLP) models, subword tokenization, and vocabulary construction algorithms like BPE, WordPiece, and UnigramLM. The choice of inference methods in implementations has a significant impact on the algorithm’s compatibility and its effectiveness. However, it is often unclear how well inference methods match with the tokenizer vocabularies and whether this compatibility is necessary or beneficial.
Previous research efforts focused primarily on the development of vocabulary construction algorithms, optimization of vocabulary size, and multilingual vocabularies. These studies rarely delved into the understanding of the inference methods. There is a clear need for a comprehensive evaluation of inference methods across various vocabularies.
Addressing this gap, researchers from Ben-Gurion University of the Negev Beer Sheva and Massachusetts Institute of Technology conducted a detailed experiment. They evaluated seven tokenizer inference methods across four different algorithms and three vocabulary sizes. This comprehensive review incorporated an intrinsic evaluation suite combining measures from morphology, cognition, and information theory, specifically for English.
Among the most commonly used tokenizers, the study found that Greedy inference typically performed quite well. Greedy inference considers and produces only a single token at each step and has three primary approaches—Longest prefix, Longest suffix, and Longest token—each selecting the longest available token, either from the prefix, the suffix, or the entire word.
The researchers’ extensive study found variations in the performance metrics of various algorithms depending upon the inference method. Those inference methods based on merge rules often outperformed the default strategies, particularly in terms of morphological alignment. At times, likelihood-based methods assigned high likelihood values to often-used tokens, which affected the overall segmentation quality.
Among the evaluated algorithms, SaGe—that provides contextually informed tokenization—demonstrated superior alignment to morphology. BPE and WordPiece performed well in compression but lacked in cognitive benchmarks. The likelihood and information-based vocabularies showed consistent trends within their categories, which asserted the benchmark’s robustness.
The study delineates the significance of picking the appropriate inference methods for specific vocabularies and tasks. The research emphasizes the practical importance of selecting suitable inference methods given their effect on computational efficiency. Such selection can assist in the training of language models by refining tokenization schemes. In this context, Greedy inference’s surprisingly strong performance, particularly for morphologically driven tasks, is notable. It holds even for tokenizers trained on different objectives, thus making it a method of choice. The researchers hope that their work on an aggregated benchmark for evaluating subword tokenizers will contribute to setting new standards in the field.