Understanding SAE features using Sparse Feature Circuits

🐬

Lovis Heindrich

CompleteGrant

$11,000raised

Project summary

I, Lovis Heindrich, am planning to research the use of sparse autoencoder circuits to better understand SAE features. The project will be carried out during a research visit at the Torr Vision Group at the University of Oxford and mentored by Fazl Barez, Prof Philip Torr and in collaboration with Veronika Thost (MIT-IBM Watson Lab). I am seeking funding of $8000 to cover my salary to work on the project full-time for 3 months, as well as $3000 for additional compute budget. In case the compute budget will not be fully utilized, it will be used to cover conference fees.

What are this project's goals and how will you achieve them?

Recent work on SAEs [Anthropic 2024] has demonstrated the feasibility of SAE feature discovery in larger models and discovered safety relevant features that are causally important for the models’ behavior. Understanding what causes these features to activate is an important open research question. Our project’s goal is to create circuit-based explanations of such SAE features. Current approaches [Anthropic 2024, OpenAI 2023] that use activating dataset examples to generate feature explanations are limited because they can result in overly broad explanations or interpretability illusions [Bolukbasi et al. 2021]. We plan to make progress on this problem using circuit discovery methods [Syed et al. 2023, Marks et al. 2024, Dunefsky & Chlensky 2024]. We will explore various potential ways the circuit-based explanations can be used to improve our understanding and the usefulness of sparse autoencoder features.

How will this funding be used?

$8000 will cover my salary to work on the project full-time for 3 months. The remaining $3000 will be used for compute and/or conference fees.

Who is on your team and what's your track record on similar projects?

Lovis Heindrich: I’m a past MATS scholar where I worked with Neel Nanda and have published relevant work where I analyzed MLP circuits in Pythia-70M. Additionally, I have experience training and evaluating sparse autoencoders from working on them during the MATS extension.

Fazl Barez, Veronika Thost, Philip Torr

What are the most likely causes and outcomes if this project fails? (premortem)

If this project were to fail, we’d expect the most likely causes to be either that feature circuits are too distributed or rely on uninterpretable features. Insufficiently good SAEs could also limit the project, especially when a large proportion of the feature circuits can’t be explained by earlier SAE features. We are optimistic that utilizing recent sparse autoencoders, including recent improvements to the SAE architecture will help us overcome these potential issues.

What other funding are you or your project getting?

The Torr Vision Group at the University of Oxford will provide access to a compute cluster and cover costs related to accommodation and travel to Oxford.

🐬

Lovis Heindrich

about 2 months ago

Final report

Description of subprojects and results, including major changes from the original proposal

I completed the research project with a paper on SAE feature generalization accepted at the ICML 2025 Workshop on Reliable and Responsible Foundation Models: https://arxiv.org/abs/2502.19964

The project evolved from the original plan of circuit discovery to a systematic evaluation of feature generalization. In the initial stages, I focused on SAE circuit discovery methods for safety-relevant features. I collected safety-related datasets, identified relevant SAE features, and implemented attribution patching methods for SAE features using the Gemma 2 models. I analyzed features across multiple domains: cybersecurity, sycophancy, bias, and answerability. Across all domains, I found that discovered circuits were highly distributed and that SAE features showed inconsistent generalization across different examples within the same domain. These observations led me to investigate whether the identified SAE features represent generalizable abstractions.

For the generalization study, I focused on answerability as a test case for abstract concept representation. I developed an evaluation framework using three established and two novel datasets (mathematical equations and celebrity recognition) designed to test different aspects of generalization. I identified SAE features based on their in-domain performance on one dataset and evaluated their transfer to the remaining four out-of-distribution datasets, comparing against linear probes trained on residual stream activations. While I found well-performing SAE features on the in-domain dataset, their transfer to the out-of-distribution datasets was inconsistent. Some features generalized better than linear probes on specific dataset pairs, while others showed near-random transfer performance. No feature reliably generalized to all out-of-distribution data better than the residual probe.

The paper demonstrates that while SAE features accurately represent concepts on in-domain data, they often fail to represent these concepts in an abstract and generalizable way that consistently transfers to out-of-domain data. The results suggest particular challenges when investigating abstract, high-level concepts compared to more typically studied syntactic features, revealing the need for out-of-distribution evaluation in SAE research.

I'm grateful to Fazl Barez, Veronika Thost, and Philip Torr for their supervision and mentorship. Thanks to Neel Nanda, Manifund, and the Torr Vision Group for facilitating and funding this project.

Spending breakdown

Salary: $11000 (original $8000 + reallocation of $3000 compute and travel budget)

The project was extended significantly beyond the initial scope. In addition to 3 months of full-time work from June through August, I spent an additional 5 months of part-time work on completing the project and writing the paper.

Travel costs and compute were provided by the Torr Vision Group.

Austin Chen

over 1 year ago

Approving this as in line with our mission of advancing AI safety research. Thanks to Lovis and Neel for their public writeups on this!

donated $11,000

Neel Nanda

over 1 year ago

I had previously discussed this grant with Lovis and suggested he apply.

Why is this a good idea?

I think Sparse Autoencoders are one of the most promising areas of mech interp work right now. Better understanding SAE circuits seems exciting, and I think that understanding the circuit required to produce a feature is an important direction. This is both a sub-part of the broader project of finding end-to-end circuits, and could help with interpreting what a feature does (especially important features like the safety relevant features in Scaling Monosemanticity) - I would be very excited if this project finds case studies of features that have ambiguous maximum activating examples, but the meaning is clarified by studying a circuit.

(Note that the applicants shared me on a more detailed project proposal than what was shared publicly, which I broadly think was sensible, though I disagreed on some points)

Concerns

Research is hard, and there's a good chance this project doesn't really go anywhere interesting
This is a hard and somewhat open-ended question, though I think they had some decent ideas of concrete entry points
There's many directions the project could go in, and it'd be easy to get caught in rabbit holes/constantly flit between things and never do any of them properly.

Why this amount?

This was the salary requested, I think somewhat pegged to academic summer researcher salaries, which are a fair bit lower than the market rate for independent researchers, so no complaints from me. The compute may not be needed, since the lab provides some, but it would be silly for the project to be bottlenecked by lacking compute. This overall seems like a fairly small grant, with some chance of going somewhere interesting, and so a pretty obvious accept.

Conflicts of interest

Lovis is one of my MATS alumni, but we haven't been working together for several months, so I don't feel too concerned about the conflict of interest, and it means I have a fair amount of data to evaluate him. I don't personally benefit from this project (except in that all good mech interp research helps my own work!), and don't anticipate being a co-author on any papers produced