You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Full funding application available here, including explanation of threat model, developed plan, extensive theory of change and risk assessment. Please treat the below as an executive summary.
(Goal) refers to activities covered by the funding goal as opposed to funding minimum.
Problem: It is unclear how well interpretability techniques that assess model internals by reducing them to causal mechanisms will generalize as capabilities advance and new architectures and paradigms are introduced.
I want to develop some arguments that suggest the generalizability of these techniques will be vulnerable to the following adaptations:
Scaffolding changes
Substrate-level (i.e., model, architecture, paradigm) changes
AI-assisted development of the above
Self-modification
I want to also extend this roadmap to consider risk factors that will be amplified through these vulnerabilities:
Deep deceptiveness, in which deception is hidden via low-level reconfigurations that evade monitoring/interpretability techniques whilst leaving overall behavior intact.
Aggregate/Diffused deceptiveness, in which deceptive circuits are distributed across multiple connected AI systems, increasing the space over which interpretability techniques must search.
These arguments are mostly conceptual at the moment and have been presented in an upcoming arXiv publication (co-authors: Matt Farr, Chris Pang, Sahil K; draft available here). They have generated interest and both positive and critical feedback from researchers. I seek funding to continue developing the work’s realism and technical rigor by (i) iterating the paper using the received feedback and suggestions and (ii) upskilling in the relevant technical areas to more fully assess the work’s core premises.
This work will feed into the wider MoSSAIC project, which is being worked on by others within the high actuation spaces group. I believe my work here will motivate and structure the development of an alternative paradigm for interpretability that tackles the higher-level structures we care about in AI safety.
Clarify and Articulate Reductionist/Mechanistic Paradigm
Examine the ways in which AI safety (especially interpretability) privileges a bottom-up approach from causal mechanisms to goal-oriented behaviour.
More specifically, to investigate two key assertions that we claim to be crucial for the continued success of the paradigm.
Ontological: That structural properties discovered in AI systems remain stable as capabilities increase
Epistemological: That structural properties can reliably indicate safety-relevant behaviors
Approach
Theoretical research/upskilling
Mech-interp and (Goal) dev-interp/causal incentives
Parallel debates in neurology/philosophy of mind/science
Refinement of terminology/concepts of substrate, mechanism, etc., through feedback and further examples.
Develop "Substrate-Flexible Risk" Framework
Establish/present a new threat model for AI safety that captures:
Changes in AI architectures and paradigms
Self-modifying AI systems
Deep deceptiveness and its distributed equivalent (termed aggregate risk in the paper).
Use threat model framing to express a number of evasive risk scenarios, including Deep Deceptiveness, Sharp Left-Turn, and Robust Agent-Agnostic Processes.
(Goal) Assess importance of substrate-flexible risks in governance/regulation practices.
(Goal) Motivate and start developing solutions to threat model (see Section 5 onwards)
Approach
Incorporation of feedback on current draft of MoSSAIC publication
Examination of case studies of architecture/paradigm changes (Mamba, Kolmogorov-Arnold networks, transformers).
(Goal) Analysis of Anthropic/AISI safety case policy directions. Consultation with governance experts.
Minimum funding will cover my living costs and minor research expenses for 3 months at 0.6 FTE.
Goal funding is 4 months FTE with access to LISA.
I am collaborating with Chris Pang, under the supervision of Sahil K (independent, ex-MIRI). We are both members of the High Actuation Spaces community and have access to a wide range of researchers who can provide feedback as we develop the project.
I also have Matthew Wearden as my research manager, who will serve as another set of eyes on the process.
I have co-authored the abovementioned paper and presented the earlier sketches of these ideas at LISA and MATS. I have completed similar long-term projects as part of my university degree.
The literature may be insufficient, or the identification of a reductionist paradigm may be impossible for more theoretical reasons.
I may also simply run out of time. I will assess from the halfway point, in consultation with my supervisor and RM.
I intend to commit some time to assessing any negative results, and I suspect this work will also feed into some of the work being conducted by others regarding foundations and limitations of mech-interp.
The full MoSSAIC project was added to MATS 6.0 (June--August 2024) by Ryan Kidd, and so received the $9,000 stipend from AI Safety Support.