Project summary
TL;DR: we, Michelle Lo and Fazl Barez, are planning to conduct a follow-up study to our ACL 2024 paper, "Large Language Models Relearn Removed Concepts" which demonstrated that LLMs can quickly regain performance on concepts that were supposedly removed via neuron pruning. This new project aims to further investigate whether unlearning does or does not work effectively, focusing on various unlearning techniques beyond neuron pruning. We'll work on this for 3 months FTE equivalence at a rate of approximately $10,000 per month, and $10,000 for compute and other costs, totaling around $40,000.
What are this project's goals and how will you achieve them?
Our previous work at ACL 2024 showed that LLMs can rapidly relearn concepts that were supposedly removed through neuron pruning. This project aims to extend those findings to demonstrate that unlearning, more broadly, does not work effectively. We will:
Conduct literature reviews on various unlearning techniques beyond neuron pruning, such as gradient ascent on forget sets, KL divergence minimization, and model editing via control vectors.
Design and run small-scale experiments applying these techniques to remove specific concepts from LLMs.
Track concept reemergence during retraining or fine-tuning, similar to our ACL work, but adapting the methodology for each unlearning technique.
Write up our findings in paper for ICML 2025, showing if unlearning can actually be a solution for AI safety concerns.
We hope this project sheds light on both the potential benefits and limitations of unlearning techniques as an approach for enhancing the safety of language models.
How will this funding be used?
The majority will be spent on our salary (Fazl 3 months equivalent FTE, Michelle Part time until September) , taxes, and fees, including access to a coworking space when travelling. Some may be used for computational resources if our experiments require more than what's freely available.
Who is on your team and what's your track record on similar projects?
Our track record is strong:
[Relevant Research] Fazl and Michelle are the first joint and senior authors on "Large Language Models Relearn Removed Concepts" at ACL 2024, which is directly related to this project.
Fazl has previously worked on 25+ papers related to AI safety. For example, Fazl was an author on the Sleeper Agents paper (Hubinger et al, 2024) and Sycophancy to Subterfuge paper (Denison et al, 2024).
[Publication and leadership] Fazl has co-authored 9 papers in 2024 so far, including 1 published paper at ICLR, 2 published papers at ICML, several other papers under review at NeurIPS, EMNLP 2024, and is co-organising the Mechanistic Interpretability workshop at ICML 2024. Michelle is currently doing an internship at G-Research as a Research Engineer, leading a project on controlling combinations of ML model predictions, and previously interned at Google.
[Time management and mentorship] Fazl has mentored 15+ BSc, MSc, PhD students in the past year. This summer Fazl will be mentoring 6 students, 3 of whom are PhDs, 2 MSc’s and 1 undergrad, who are all working on topics related to AI safety. Michelle is an undergraduate teaching assistant at Imperial College London, and has mentored first year computer science students on weekly bases for the past 2 years.
What are the most likely causes and outcomes if this project fails? (premortem)
Limited scope: Our ACL work focused on neuron pruning. If other unlearning techniques behave fundamentally differently, our hypothesis might not hold. However, this would still be an interesting result, suggesting that some unlearning methods are more robust than others.
Difficulty in concept tracking: Our methodology for tracking concept reemergence may not translate well to other unlearning techniques. We could mitigate this by adapting our approach for each technique, drawing on our experience and possibly consulting with other experts in the field.
Computational constraints: Large-scale experiments might be prohibitively expensive. In this case, we might focus on smaller models or fewer experiments, which may limit the generalizability of our findings. However, even small-scale results could be compelling.
What other funding are you or your project getting?
None: we are currently working on this without funding for the past 4 weeks.