This project aims to remove hazardous CBRN and cyber knowledge from AI models. To do this, researchers will develop CBRN and cyber knowledge evaluations which measure precursors for dangerous behavior but which are not info-hazardous. Then, the researchers develop unlearning techniques to remove this knowledge. Finally, the research will be communicated to NIST to inform their dual-use foundation model standards.
This project’s goal is to remove hazardous CBRN and cyber knowledge from AI models. This project includes:
Developing datasets of CBRN and cyber knowledge which contain precursors for dangerous behavior but which themselves are not info hazardous. (E.g., knowledge of reverse genetics itself isn’t hazardous but is required to do more hazardous things.)
Developing unlearning techniques to remove this precursor knowledge.
To develop the dataset, researchers will be working with a large group of academics, consultants, and companies, including cybersecurity researchers from Carnegie Mellon University and biosecurity experts from MIT (e.g., Kevin Esvelt’s lab).
To develop the unlearning techniques, researchers will experiment with many different methods. Methods need to 1) remove the relevant precursors to hazardous knowledge and 2) preserve general domain knowledge which is not hazardous.
The funding will be used to pay consultants and contractors for dataset collection.
Alex Pan, one of the research leads, is a PhD student at UC Berkeley in Jacob Steinhardt’s lab (https://aypan17.github.io/). He has published two first-authored papers at top-tier ML conferences (ICML and ICLR) on reward misspecification and measuring the safety and ethical behavior of LLM agents.
The other research lead is Nat Li, who is a 3rd year undergraduate at UC Berkeley and has co-authored two ML papers previously (1, 2). I will also help advise this project and have a long track record of empirical AI safety research (3).
One of the main risks is whether the consultants and contractors can be directed. If the consultants produce general bio/chem knowledge instead of specifically precursors to hazardous capabilities, the resulting dataset won’t be useful to unlearn.
None.