What progress have you made since your last update?
(Un)fortunately, a lot of the research areas I was interested in exploring have become substantially more mainstream since I wrote the research proposal. For example, Stephen Casper and collaborators have put out their latent adversarial training paper, FAR has completed their work on adversarial example/training scaling laws for transformers, and many at Anthropic and other labs are investing significant amounts of time and resources into adversarial training and related areas.
Instead, I've done work trying to lay the theoretical framework behind ideas in mechanistic interpretability. While there's been a lot of incremental empirical work on improving SAEs or applying SAEs to larger models, there's many theoretical questions in interp that are much more neglected. Specifically, over the course of the grant, I've worked on and completed the following two projects:
Compact Proofs and Mechanistic Interp: There's been a small but steady amount of ongoing discussion on methods to evaluate circuits or mechanistic explanations in general. On one end, we have sampling-based methods like causal scrubbing, and on the other end, we have proofs. So it's natural to explore the question -- can we use proof length/quality to evaluate the degree of mechanistic undertanding? Can we even write nonvacuousproofs about model behavior at all? We've completed preliminary work showing that the answer to both questions is yes on small max-of-K transformers: blog post, paper.
Models of Computation in Superposition: There's an implicit model of computation in superposition that people in mech interp seem to rely on, where models are using superposition to approximately represent a sparse boolean circuit. In contrast, the standard toy models of superposition are Anthropic's TMS, which focuses on representational superposition, and Scherlis et al's model involving quadratic forms, both of which only consider superposition occurring at a single layer. With some collaborators, we've built out a model of superposition that is closer to the implicit model, where superposition both allows for more compact computation of larger circuits, and where superposition can be maintained across many layers (paper).
What are your next steps?
I'm deciding between returning to METR, my PhD, or other job opportunities.
In case I do pursue a job that allows me to do related research, I'll probably follow up to the projects as follows:
Proofs and Mech Interp: I don't believe that formal proofs will scale to frontier models, but working on the project has made me more convinced of the feasibility of building formal-ish systems for doing mech interp. A natural follow up would be ARC Theory's heuristic arguments work (as laid out in Jacob's recent blog post), which does neatly go around one of the main issues with scaling proofs. I'd probably work on empirical work applying heuristic arguments on transformer models.
Models of Superposition: Over the course of this work, I've become convinced that the model of networks as computing sparse boolean circuits is incorrect. Instead, it seems like every sort of interesting sort of computation requires inhibition or competition between features, and sparse boolean circuits do not allow for inhibition. I think building a model of how inhibition works for features in superposition is the natural next step. (Relatedly, see this post by Redwood on how inhibition allows for substantially more representation power in a one-layer attention transformer than "just" skip-trigrams.)
Is there anything others could help you with?
I'm pretty confused what to do career wise -- any advice would be appreciated.