Description of subprojects and results, including major changes from the original proposal
I completed the research project with a paper on SAE feature generalization accepted at the ICML 2025 Workshop on Reliable and Responsible Foundation Models: https://arxiv.org/abs/2502.19964
The project evolved from the original plan of circuit discovery to a systematic evaluation of feature generalization. In the initial stages, I focused on SAE circuit discovery methods for safety-relevant features. I collected safety-related datasets, identified relevant SAE features, and implemented attribution patching methods for SAE features using the Gemma 2 models. I analyzed features across multiple domains: cybersecurity, sycophancy, bias, and answerability. Across all domains, I found that discovered circuits were highly distributed and that SAE features showed inconsistent generalization across different examples within the same domain. These observations led me to investigate whether the identified SAE features represent generalizable abstractions.
For the generalization study, I focused on answerability as a test case for abstract concept representation. I developed an evaluation framework using three established and two novel datasets (mathematical equations and celebrity recognition) designed to test different aspects of generalization. I identified SAE features based on their in-domain performance on one dataset and evaluated their transfer to the remaining four out-of-distribution datasets, comparing against linear probes trained on residual stream activations. While I found well-performing SAE features on the in-domain dataset, their transfer to the out-of-distribution datasets was inconsistent. Some features generalized better than linear probes on specific dataset pairs, while others showed near-random transfer performance. No feature reliably generalized to all out-of-distribution data better than the residual probe.
The paper demonstrates that while SAE features accurately represent concepts on in-domain data, they often fail to represent these concepts in an abstract and generalizable way that consistently transfers to out-of-domain data. The results suggest particular challenges when investigating abstract, high-level concepts compared to more typically studied syntactic features, revealing the need for out-of-distribution evaluation in SAE research.
I'm grateful to Fazl Barez, Veronika Thost, and Philip Torr for their supervision and mentorship. Thanks to Neel Nanda, Manifund, and the Torr Vision Group for facilitating and funding this project.
Spending breakdown
Salary: $11000 (original $8000 + reallocation of $3000 compute and travel budget)
The project was extended significantly beyond the initial scope. In addition to 3 months of full-time work from June through August, I spent an additional 5 months of part-time work on completing the project and writing the paper.
Travel costs and compute were provided by the Torr Vision Group.