Project summary
Sparse autoencoders (SAEs) are good at extracting distinct, significantly monosemantic features out of transformer language models. An effective dictionary of feature decompositions for a transformer component roughly sketches out the set of features that component has learned.
The goal of this project is to provide a more intuitive and structured way of understanding how features interact and form complex high-level behaviors in transformer language models. To do that, we want to define and explore the feature manifold, understand how features evolve over different transformer components, and how we can use these feature decompositions to find some interesting circuits.
What are this project's goals and how will you achieve them?
SAE Feature Decomposition for Circuit Search Algorithm Development
A major challenge in using sparse autoencoders for future interpretability work is turning feature decompositions into effective predictive circuits. Existing algorithms, such as ACDC, are based on pruning of computation graphs and not easily scalable to larger models.
Cunningham et al. demonstrate that causal links can be identified by modifying features from an earlier layer and observing the impact on subsequent layer features.
A promising approach for a circuit search algorithm would be to observe changes in feature activations upon ablating features in a previous layer. We could focus on a subset of the input distribution to simplify the analysis and find more interpretable features.
For modeling end-to-end behaviors, we would use an unsupervised learning algorithm to learn clusters of features and identify similar ones (i.e. learnfeature manifolds). We would then use a similarity metric (such as cosine similarity) to group features and use ACDC over the resulting feature decompositions.
Further, investigate how feature splitting occurs in the context of these manifolds. Are features divided into smaller manifolds, or do they split within a manifold?
Optimizing feature extraction
Connecting dictionaries learned on different transformer components
How do the learned features evolve over different components of the model?
For example, how do the features learned by the MLPs, with their richer dimensionality, relate to those learned by the attention heads and the residual stream?
What metrics can we develop to better understand and measure these relationships?
What mappings are suitable for connecting dictionaries? Can we use gradient descent to find the connections between two SAEs learned on different components of the transformer model?
What specific properties make a feature more suitable for inclusion in the residual stream?
If a feature is not added to the residual stream at a certain layer (L), can it emerge (if so, under what conditions?) in a subsequent layer (L+k)?
Can we predict whether and when a model will exhibit end-to-end behaviors by tracking the addition of constituent features to the residual stream at various stages of training?
Efficiency of feature learning in SAEs
If a model is trained on a dataset D’ which is a subset of a dataset D, do the features it learns form a clear subset of the features learned by the same model trained on the entire dataset D?
Does the efficiency of feature learning diminish?
Can we train smaller SAEs to find subsets of features?
Analysis of feature quality - are features learned on D’ less or more noisier than features learned over D?
Do features learned by the model on a subset dataset generalize well to the full dataset?
Can we develop better metrics for comparing feature sets?
How will this funding be used?
The main use of the grant will be to acquire the compute resources needed for running the experiments. I would train multiple sparse autoencoders (SAEs) using A100 GPUs for a range of open sourced models, including for fine tuned and base models for comparisons. I would experiment with different SAE architectures, feature evolution, etc. The remaining funding amount would be to provide a modest salary for me, as I am currently funding this out of pocket.
What are the most likely causes and outcomes if this project fails? (premortem)
It’s possible that
the algorithm for circuit search using feature decompositions does not outperform traditional methods, or does not scale well.
Navigating the feature manifold requires a lot of computational resources and proves to be difficult.
Project has the potential to attract the attention of capabilities labs.
What other funding are you or your project getting?
None - I am funding this project out of pocket at the moment. That is primarily why I am applying for the grant, because I want to expand the scope of the project which requires additional computational resources.
I want to pursue alignment work (focused on interpretability), full-time, and will be applying for funding for that.