You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
Develop techniques to impose just enough structure on LLM latent spaces during training to enable precise monitoring and post-hoc intervention.
Models naturally develop structured latent representations (e.g. the residual stream in transformers), but we currently have little control over how concepts organize. Prior attempts have focused on comprehensive supervision or post-hoc discovery, rather than minimal anchoring during training. My hypothesis is that if we give the models a gentle nudge right from the start of pre-training, they could become much more interpretable for the specific concepts and behaviors that we care about.
My preliminary experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, i.e. anchored concepts act as attractors during training. This happens even with only weak supervision, which suggests that it could be made to work for models trained on very large corpora that are hard to label (e.g. frontier models). Knowing where key concepts are located could enable surgical removal of dangerous capabilities without broad performance degradation, and could be highly resistant to capability recovery.
I see a path forward to apply this to LLMs, and I would like to pursue it.
Delivery goal: Develop techniques that provide alignment researchers with:
Known concept locations before deployment (no search required)
Surgical intervention capabilities (remove specific harms without general performance loss)
Training-time safety measures rather than purely post-deployment fixes, including the potential to monitor these capabilities during training.
Personal goal: Complete my transition from software consultancy to AI safety research.
Proof-of-concept structuring of latent spaces in bottleneck autoencoders (COMPLETED)
Practical intervention (4-5 weeks): Demonstrate concept suppression and precise unlearning, using autoencoder from initial experiments (Minimal funding scenario)
Transformer transfer (8-10 weeks): Apply techniques to attention-based architectures using small transformers
Language model application (6-9 weeks): Structure abstract concepts (e.g. deception, harmfulness) in transformer language models, including development of the required training data pipeline
For more details, see Concept-anchored representation engineering for alignment, which is the introductory post in my LessWrong sequence on this research.
Clean suppression of target concepts without degrading unrelated capabilities, measured through controlled evaluation tasks.
One or more LessWrong posts per milestone
All experiments (notebooks) published on GitHub
Technical AI alignment research is increasingly focused on understanding and controlling the inner workings of large language models. This project contributes by exploring novel methods for shaping latent representations during training, complementing existing approaches that focus on post-hoc analysis or extensive supervision. If successful, this work could make other alignment efforts considerably easier. For example, I hope to demonstrate clean removal of concepts, which could improve the safety of even open-weights models.
Minimal scenario: 5 weeks to deliver Milestone 2 only: 77% salary, 14% overheads, 9% buffer.
If fully funded: 24 weeks to deliver Milestones 2-4: 85% salary, 6% overheads, 9% buffer.
The salary is based on what I could get as a full-time Senior AI Engineer today (including income tax and superannuation). This is effectively a discounted rate, because I haven’t adjusted for the loss of other Australian workplace entitlements such as leave (usually contractors charge more because of this).
Full details in this spreadsheet.
I will conduct this research independently, but I am part of a local community of practice (Melbourne AI Safety Hub). We frequently meet in person to discuss our work, including this research. Some other members of the Hub are also working on DevInterp, so I have people to validate my ideas with.
Published two technical posts on LessWrong:
Selective regularization for alignment-focused representation engineering (Milestone 1). Demonstrates novel adaptation of regularization techniques for alignment applications, showing both positive results (predictable latent structure) and negative results (curriculum learning approaches that didn't work as expected - valuable research practice).
Detecting out of distribution text with surprisal and entropy (my BlueDot project). I reproduced a jailbreak filter paper and developed a novel token-level metric and visualization.
20+ years professional software development across many high-tech domains from auto engineering, through power generation and specialized sensors, to continent-scale geospatial data storage and processing. My experience includes end-to-end delivery of ML/AI applications and data platforms. I have developed strong spatial intuition through my work on graphics (I am a contributor to Blender), game development, and geospatial data processing (including high-dimensional data).
For a view into my understanding of the transformer architecture, see my re-implementation of nanoGPT, which includes detailed notes.
Consistently highly motivated: completed a five-year independent video game development project. Current research pace: 6 weeks for initial milestone including infrastructure setup, suggesting realistic timeline estimation for remaining work.
See my LinkedIn profile for written recommendations.
This is a research project, but one with few uncontrolled variables. The most likely cause for failure would be unforeseen technical blockers: it seems tractible to me now, but perhaps I don’t appreciate the difficulty of what I’m attempting. In that case, I would still publish the negative results.
None. I have recently submitted a similar application to LTFF, but it seems they are entering a grantmaking pause. I intend to apply for other sources of funding. Another application is being made specifically for the use of a coworking space; currently, that is included in this proposal. If any of those are successful, I will update this application.