This project is about producing a mechanistic interpretability paper, building on previous research done on Sparse Auto Encoders. At this point, we are quite good at finding and labelling features. However, very little research (Sam Marks et. al) has been done on how the features interact.
The goal of this project is to publish a paper about a novel way of identifying SAE-feature circuits. The paper should investigate the following points:
Implementation of the gradient-based circuit-discovering algorithm.
Modifying edges in the circuit and observing changes in the behaviour of the model.
(optional) Using connections to labelled features to better label features with a hard-to-interpret label.
This funding will mainly be used to cover my living costs during the research and paper writing period. There are also some computational costs to be covered but this is only a minor part.
I'm working with Jason Liu from Cambridge.
The most likely causes are that the computational costs of running my circuit finding algorithm are too high at the required scale. Initial tests with my un-optimized prototype setup indicate that this failure is still unlikely.
Another risk is that no interesting results emerge from the experiments. For technical reasons, i also view this as unlikely.
I have raised $54'000 in the last 12 months. ($11k from Openphil and $43k from the Hasler Foundation). With this, I produced two mech-interp papers that both got accepted (one of them at the ICML mech-interp workshop)