Research on SAE-Feature-Circuits

Project summary

This project is about producing a mechanistic interpretability paper, building on previous research done on Sparse Auto Encoders. At this point, we are quite good at finding and labelling features. However, very little research (Sam Marks et. al) has been done on how the features interact.

What are this project's goals? How will you achieve them?

The goal of this project is to publish a paper about a novel way of identifying SAE-feature circuits. The paper should investigate the following points:

Implementation of the gradient-based circuit-discovering algorithm.
Modifying edges in the circuit and observing changes in the behaviour of the model.
(optional) Using connections to labelled features to better label features with a hard-to-interpret label.

How will this funding be used?

This funding will mainly be used to cover my living costs during the research and paper writing period. There are also some computational costs to be covered but this is only a minor part.

Who is on your team? What's your track record on similar projects?

I'm working with Jason Liu from Cambridge.

What are the most likely causes and outcomes if this project fails?

The most likely causes are that the computational costs of running my circuit finding algorithm are too high at the required scale. Initial tests with my un-optimized prototype setup indicate that this failure is still unlikely.
Another risk is that no interesting results emerge from the experiments. For technical reasons, i also view this as unlikely.

How much money have you raised in the last 12 months, and from where?

I have raised $54'000 in the last 12 months. ($11k from Openphil and $43k from the Hasler Foundation). With this, I produced two mech-interp papers that both got accepted (one of them at the ICML mech-interp workshop)