Thank you for supporting this project! Sparse Feature Circuits won an ICLR 2025 spotlight, and informed work across the field. https://features.baulab.info/
Recent work has developed methods to automatically discover "circuits" implementing basic functionality in neural networks. These circuits are described in terms of computation involving neurons. However, concurrent work has argued that neurons are often in "superposition" representing multiple features at the same time, and that sparse autoencoders can extract these features from neurons. This poses the question: would circuits be more discoverable and easier to understand if represented at the feature level rather than neuron level? This project intends to explore automatic circuit discovery over features.
At a high-level the project will seek to determine if circuits over features are easier to understand and more readily discoverable by automatic means than circuits over neurons (the baseline method). Concretely, preliminary results have found circuits for subject-verb agreement. Can intends to extend this to a simple ELK-style problem, to identify a concept represented by the model from a labeled dataset of positive and negative examples of the concept, while disambiguating it from a highly correlated concept the model also represents based on understanding how the respective concepts are computed.
Research stipend to Can Rager. Can may pay for other expenses such as office space and compute out of this stipend.
Can Rager (Google Scholar) has experience working on automatic circuit discovery from the Attribution Patching Outperforms Automated Circuit Discovery preprint.
He will collaborate with Sam Marks, a post-doc in the Bau Lab, that specializes in interpretability, and Prof. David Bau.
The research idea is high-risk high-reward: circuits on features may prove to be difficult to find and understand.
Additionally the entire field of mechanistic interpretability is taking a high-risk bet that neural networks can be reverse engineered at scale. Although automatic methods such as the one this project investigates help with scalability, many challenges remain. Even if the project succeeds by its own lights, if future work cannot develop the method further then it may turn out to be a research dead-end.
There is a risk in terms of execution with Can being relatively new to research. This is mitigated by collaboration with Sam & David, however David may be time-constrained (as a professor with many PHD students), and Sam's PhD was in math so may have limited hands-on engineering experience.
Can Rager
1 day ago
Thank you for supporting this project! Sparse Feature Circuits won an ICLR 2025 spotlight, and informed work across the field. https://features.baulab.info/
Adam Gleave
over 2 years ago
Promising research idea; "obvious next step" but not one that anyone else seems to be working on.
Can Rager has relevant research experience.
David Bau's lab is a recognized name in the field and a competent collaborator.
Limited track record from Can.
Research project is high-risk, high-reward.
$6000-$9000/month seems to be around the going rate for junior independent research based on previous LTFF grants. I went on the higher end as: (a) stipend may need to pay for office expenses not just living expenses; (b) Can intends to be based in the Bay Area for some of this time, a high cost-of-living location.
Can may spend some of his stipend on a desk & membership in FAR Labs, an AI safety co-working space administered by the non-profit FAR AI that I am the founder and CEO of. This is not a condition of this grant, and I have encouraged Can to explore other office options as well. I do not directly benefit financially from additional members at FAR Labs, nor would one member materially change FAR AI's financial position. No other conflicts of interest.
Neel Nanda
about 2 years ago
@AdamGleave Just noting that I was quite impressed by the paper that came out of this ( https://arxiv.org/abs/2403.19647 ) - good grant, and good work by Sam, Can and co!