Understanding the decision-making process of Large Language Models (LLMs) becomes paramount as they start to be utilized in high-risk applications. Such understanding is vital in managing possible risks. However, the opacity of these models makes it challenging, thus becoming a subject of interpretability research. Artificial neural networks’ qualities of being observable and deterministic allow researchers to delve further into these models. Thorough comprehension of these models boosts our knowledge and helps in building AI systems with minimal harm.
New research from MIT and the University of Cambridge delves into the universality of neurons, particularly in the GPT2 language models. The study aims at identifying and examining neurons with universality across different initializations. How well universal characteristics are spread across the model has significant implications for creating automatic methodologies in understanding and supervising neural circuits.
This study primarily focuses on transformer-based language models in the replication of the GPT2 series and Pythia family experiments. Activation correlations measure whether neuron pairs are consistently activated by the same inputs across models. The researchers speculate that universal neurons may have a more singular meaning, despite individual neurons’ polysemy.
The universality of neurons is determined based on activation correlations. The results of the study challenge the idea of universality across the majority of neurons, with only 1-5% meeting the universality threshold.
Further analysis shows that universal neurons have distinct characteristics in weights and activations compared to non-universal ones. These neurons often engage in actions, performing functions other than merely extracting features.
The research concludes that only a small fraction of neurons show universality. However, these universal neurons often pair up opposingly, indicating the potential for ensemble-based robustness and calibration improvement.
Certain gaps in the research include its concentration on small models and specific universality limitations. Prospects for future research include experimentation on an overcomplete dictionary basis, exploration of larger models, and automating interpretation using LLMs. Such progressions can offer a more profound understanding of language models. All credit for this research goes to its respective researchers.
MarkTechPost, where this research is available, urges readers to follow them on Twitter and join their Reddit, Facebook, Discord, LinkedIn, and Telegram groups. Interested readers are also encouraged to subscribe to their newsletter for further updates. The complete research paper and Github link can also be found on the mentioned website.