Researchers from Anthropic have successfully identified millions of concepts within an advanced large language model (LLM), Claude Sonnet. The knowledge structure of AI models is often likened to a ‘black box,’ emphasizing the mystery behind their internal workings. This complex model architecture makes identifying individual concepts challenging, a problem Anthropic addressed using a technique called “dictionary learning.”
Anthropic’s research utilized dictionary learning to identify common patterns within Claude Sonnet, focusing mainly on the model’s mid-layer, which plays a central role in data processing. Using this tactic, Anthropic was able to extract millions of concepts from Claude Sonnet, ranging from concrete entities such as cities and people to more abstract notions like scientific disciplines and programming syntax.
The researchers also analyzed the correlation of features based on their activation patterns to understand better how the model interprets concepts. This analysis revealed that related concepts tended to cluster together within the model. To verify the features, the research team conducted “feature steering” experiments. This involved selectively modifying the activation of specific features and observing changes in the AI’s responses, establishing a direct link between individual features and model behavior.
Furthermore, the study suggests that interpretability is critical for AI safety. A greater understanding of AI’s behavior could provide valuable insights for resolving potential risks and improving transparency. For instance, these insights will help predict and mitigate biases and other unpredictable behaviors.
Anthropic’s research makes significant progress towards understanding the internal workings of LLMs. However, a complete comprehension of these models remains a challenge due to their immense complexity. Reverse engineering a model, it transpires, is more difficult and computationally intense than creating the model initially. Nevertheless, this recent work by Anthropic is a promising stride towards the successful decoding of the AI ‘black box.’