Removing Hazardous Knowledge from AIs

Project summary

This project aims to remove hazardous CBRN and cyber knowledge from AI models. To do this, researchers will develop CBRN and cyber knowledge evaluations which measure precursors for dangerous behavior but which are not info-hazardous. Then, the researchers develop unlearning techniques to remove this knowledge. Finally, the research will be communicated to NIST to inform their dual-use foundation model standards.

What are this project's goals and how they be achieved?

This project’s goal is to remove hazardous CBRN and cyber knowledge from AI models. This project includes:

Developing datasets of CBRN and cyber knowledge which contain precursors for dangerous behavior but which themselves are not info hazardous. (E.g., knowledge of reverse genetics itself isn’t hazardous but is required to do more hazardous things.)
Developing unlearning techniques to remove this precursor knowledge.

To develop the dataset, researchers will be working with a large group of academics, consultants, and companies, including cybersecurity researchers from Carnegie Mellon University and biosecurity experts from MIT (e.g., Kevin Esvelt’s lab).

To develop the unlearning techniques, researchers will experiment with many different methods. Methods need to 1) remove the relevant precursors to hazardous knowledge and 2) preserve general domain knowledge which is not hazardous.

How will this funding be used?

The funding will be used to pay consultants and contractors for dataset collection.

Who is on the team and what's their track record on similar projects?

Alex Pan, one of the research leads, is a PhD student at UC Berkeley in Jacob Steinhardt’s lab (https://aypan17.github.io/). He has published two first-authored papers at top-tier ML conferences (ICML and ICLR) on reward misspecification and measuring the safety and ethical behavior of LLM agents.

The other research lead is Nat Li, who is a 3rd year undergraduate at UC Berkeley and has co-authored two ML papers previously (1, 2). I will also help advise this project and have a long track record of empirical AI safety research (3).

What are the most likely causes and outcomes if this project fails? (premortem)

One of the main risks is whether the consultants and contractors can be directed. If the consultants produce general bio/chem knowledge instead of specifically precursors to hazardous capabilities, the resulting dataset won’t be useful to unlearn.

What other funding is this person or project getting?

None.