“Text mining” refers to the discovery of new patterns and insights within large amounts of textual data. Two essential activities in text mining are the creation of a taxonomy – a collection of structured, canonical labels that characterize features of a corpus – and text classification, which assigns labels to instances within the corpus according to this taxonomy. This process has significant practical applications, particularly in scenarios involving vast, unexplored copuses or when label spaces are undefined.
A traditional approach involves developing a label taxonomy with the help of domain experts and training a machine learning model for text classification based on human annotations on a part of the corpus. While these methods offer interpretability, they are not easily scalable, requiring substantial resources in terms of time, cost, and domain knowledge. They are also prone to errors and biases and must be entirely re-executed for every different use case.
To combat these challenges, machine learning methods like text clustering, topic modelling, and phrase mining attempt to enhance scalability, deriving the label taxonomy by characterizing learned clusters. Although this approach scales better with more substantial corpora and various use cases, it is often compared to “reading tea leaves” due to the difficulty of defining text clusters consistently and understandably.
In response to these issues, researchers from Microsoft Corporation and the University of Washington have developed TnT-LLM, a new framework combining the interpretability of manual methods with the scalability of automated topic modelling and text clustering. This two-stage approach uses large language models (LLMs) in training processes to create taxonomies and classify texts.
The first stage involves developing a zero-shot multi-stage reasoning method for taxonomy creation, tasking an LLM with devising and enhancing a label taxonomy for specific use cases based on the textual corpus. The second stage involves leveraging LLMs to increase the production of training data throughout the text classification phase, enabling the training of large-scale label classifiers.
To validate their methodology, the team has provided a range of quantitative and traceable evaluation methods, including deterministic automatic metrics, human evaluation metrics, and LLM-based evaluations. They utilised Bing Copilot, a web-scale, multilingual, open-domain conversational agent, to analyse its conversations using TnT-LLM. The results revealed that their proposed framework could generate label taxonomies with higher accuracy and relevance than existing text clustering techniques. Ultimately, the team plans to investigate hybrid models combining LLMs with embedding-based methods to enhance the framework’s speed, efficiency, and resilience, and to further examine LLM-assisted assessments. Despite most work focusing on conversational text mining, the researchers also aim to apply this methodology to other domains.