Language models have become increasingly complex, posing a unique challenge to interpret their inner workings. To mitigate this issue, research has shifted towards the concept of mechanistic interpretability, where the focus is on identifying and analyzing ‘circuits’. These circuits refer to sparse computational subgraphs that encapsulate certain aspects of the model’s behavior.
The existing methodologies for discovering these circuits, namely Automated Circuit Discovery (ACDC) and Exact Attribution Propagation (EAP), come with significant challenges.
ACDC, for instance, utilizes an inefficient greedy search approach that is not only computationally expensive but also fails to effectively scale for larger datasets or billion-parameter models. On the other hand, EAP exchanges faithfulness to the full model by employing gradient-based linear approximations. While it is faster, it hampers the pursuit of mechanistic interpretability and restricts the comprehension of complex language models’ intricate mechanisms.
A unique method, dubbed Edge Pruning, has been proposed by researchers from Princeton Language and Intelligence (PLI), Princeton University to overcome these problems. Unlike existing methodologies, Edge Pruning treats circuit discovery in language models as an optimization problem, tackled using gradient-based pruning. It adapts pruning techniques specifically to discover circuits rather than for model compression, and the focus is on pruning the edges between components instead of the components themselves.
Edge Pruning replaces the traditional Transformer residual stream with a disentangled version, enabling the inclusion of edge masks that determine which components should be read from. Using discrete optimization techniques like L0 regularization, these edge masks are optimized to produce sparse circuits. Edge Pruning strikes a balance between efficiency, scalability, and faithfulness to the full model, thereby overcoming the pitfalls of previous approaches.
Edge Pruning has shown superior performance vis-a-vis other methodologies like ACDC and EAP, particularly in complex tasks. Tests conducted on four standard circuit-finding tasks indicated that Edge Pruning finds circuits in GPT-2 Small that are more faithful to the full model. The Edge Pruning method’s efficacy was fully pronounced on complex tasks like multi-template Indirect Object Identification (IOI), where it detected circuits having 2.65 times fewer edges while maintaining faithfulness to model outputs.
Furthermore, Edge Pruning scales effectively to larger datasets, outpacing other methods in terms of speed and performance, as seen in a 100K-example version of IOI. It was also adept at perfectly recovering ground-truth circuits in two Transformers compiled by Tracr, further validating its efficiency.
However, although Edge Pruning has demonstrated promising prospects in advancing mechanistic interpretability, challenges such as memory requirements and the need for further automation in interpreting discovered circuits persist. Nevertheless, Edge Pruning symbolizes a significant step in the path towards comprehending and explaining large foundation models, thereby aiding their safe development and deployment.