The recent rise in prominent transformer-based language models (LMs) has underscored the need for research into their workings. Understanding these mechanisms is essential for the safety, fairness, reduction of biases and errors of advanced AI systems, particularly in critical contexts. Therefore, there has been an increase in research within the Natural Language Processing (NLP) community, focusing on interpretability in language models for more robust insights into their operations.
Past surveys have detailed a variety of techniques used in Explainable AI analyses and their applications within NLP. Earlier assessments primarily focused on encoder-based models such as BERT. However, the advent of decoder-only Transformers has led to developments in examining these powerful generative models. Concurrently, research has explored trends in interpretability and their correlation to AI safety, highlighting the evolving sphere of interpretability studies in the NLP domain.
Researchers from Universitat Politècnica de Catalunya, CLCG, University of Groningen, and FAIR, Meta have conducted a study that offers an in-depth technical overview of techniques used in LM interpretability research. The methods for LM interpretability discussed are categorized into two dimensions: those that localize inputs or model components for predictions and those that decode information within learned representations. Crucially, the research offers an extensive list of insights into the workings of Transformer-based LMs and provides a guide to useful tools for conducting interpretability analyses on these models.
The research puts forward two types of methods allowing for localizing model behavior: input attribution and model component attribution. Both of these have already indicated valuable insights into the ways language models work. Probing uses supervised models to predict input properties from intermediate representations, whereas other methods like the Sparse Autoencoders, disentangle features in models, promoting interpretable representations. They also provide details about several open-source software libraries, such as Captum, that help facilitate interpretability studies on Transformer-based LMs.
In summary, this thorough study emphasizes the need to understand the inner workings of Transformer-based language models to ensure their safety and fairness and minimize bias. The research contributes significantly to the growing field of AI interpretability by examining interpretability techniques and the insights acquired from model analyses. The study categorization of interpretability methods enhances comprehension in the field and facilitates ongoing attempts to improve model transparency and interoperability.