Sparse Autoencoders (SAEs) are a type of neural network that efficiently learns data representations by enforcing sparsity, capturing only the most essential data characteristics. This process reduces dimensionality and improves generalization to unseen information.
Language model (LM) activations can be approximated using SAEs. They do this by sparsely decomposing the activations into linear components using large dictionaries of fundamental “feature” directions. A decomposition is considered ‘good’ if it is sparse, meaning few dictionary elements are needed to reconstruct any given activation, and faithful, meaning that the approximation error between the original activation and its SAE decomposition is small.
Recently, Google DeepMind researchers altered the SAE design by introducing the JumpReLU SAEs. JumpReLU SAEs replace ReLU with a JumpReLU activation function that eliminates pre-activations below a certain positive threshold. This technique reduces the number of active neurons and further improves the generalization of the model, opening up new opportunities in SAE design.
This adjustment, however, does not give any gradients to train. The researchers addressed this issue by using straight-through estimators to estimate the gradient of the predicted loss, allowing for continuing training using standard gradient-based methods.
When tested against Gated and TopK SAEs using various output activations, JumpReLU SAEs were found to outperform Gated SAEs in reconstruction faithfulness and were highly efficient when compared to TopK SAEs.
Although the interpretability of JumpReLU characteristics was found to be less than Gated and TopK SAEs, it was found to improve with increasing SAE sparsity, meaning that as the SAE becomes more sparse, the features it learns become more interpretable.
This work also evaluated a single model that trains SAEs on many sites and layers, underscoring that the transferability of these results to models with different architectures or training details remains unclear.
The study concludes with optimism about future work further modifying the loss function utilized to train JumpReLU SAEs to directly address this issue, fostering anticipation for further advancements in SAE design.
The significant potential of JumpReLU SAEs, coupled with the relatively nascent research into evaluating SAE performance based on principles, suggests that the future of SAEs is promising.