Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
1

Removing Hazardous Knowledge from AIs

Technical AI safety
🐶

Alexander Pan

ActiveGrant
$190,000raised
$280,000funding goal

Donate

Sign in to donate

Project summary

This project aims to remove hazardous CBRN and cyber knowledge from AI models. To do this, researchers will develop CBRN and cyber knowledge evaluations which measure precursors for dangerous behavior but which are not info-hazardous. Then, the researchers develop unlearning techniques to remove this knowledge. Finally, the research will be communicated to NIST to inform their dual-use foundation model standards.

What are this project's goals and how they be achieved?

This project’s goal is to remove hazardous CBRN and cyber knowledge from AI models. This project includes:

  1. Developing datasets of CBRN and cyber knowledge which contain precursors for dangerous behavior but which themselves are not info hazardous. (E.g., knowledge of reverse genetics itself isn’t hazardous but is required to do more hazardous things.)

  2. Developing unlearning techniques to remove this precursor knowledge.

To develop the dataset, researchers will be working with a large group of academics, consultants, and companies, including cybersecurity researchers from Carnegie Mellon University and biosecurity experts from MIT (e.g., Kevin Esvelt’s lab).

To develop the unlearning techniques, researchers will experiment with many different methods. Methods need to 1) remove the relevant precursors to hazardous knowledge and 2) preserve general domain knowledge which is not hazardous.

How will this funding be used?

The funding will be used to pay consultants and contractors for dataset collection.

Who is on the team and what's their track record on similar projects?

Alex Pan, one of the research leads, is a PhD student at UC Berkeley in Jacob Steinhardt’s lab (https://aypan17.github.io/). He has published two first-authored papers at top-tier ML conferences (ICML and ICLR) on reward misspecification and measuring the safety and ethical behavior of LLM agents.

The other research lead is Nat Li, who is a 3rd year undergraduate at UC Berkeley and has co-authored two ML papers previously (1, 2). I will also help advise this project and have a long track record of empirical AI safety research (3).

What are the most likely causes and outcomes if this project fails? (premortem)

One of the main risks is whether the consultants and contractors can be directed. If the consultants produce general bio/chem knowledge instead of specifically precursors to hazardous capabilities, the resulting dataset won’t be useful to unlearn.

What other funding is this person or project getting?

None.

Comments2Donations1Similar7
AmritanshuPrasad avatar

Amritanshu Prasad

6 month review project on Lessons from Nuclear Policymaking for AI Policy

Science & technologyAI governanceGlobal catastrophic risks
1
0
$0 raised
🌴

FazlBarez

Evaluating the Effectiveness of Unlearning Techniques

Unlearning, AI Safety

Technical AI safety
4
3
$30K raised
paulsalmon avatar

Paul Salmon

Modeling Automated AI R&D

Technical AI safety
1
2
$0 raised
🍓

James Lucassen

More Detailed Cyber Kill Chain For AI Control Evaluation

Extending an AI control evaluation to include vulnerability discovery, weaponization, and payload creation

Technical AI safety
4
4
$0 raised
🐸

SaferAI

General support for SaferAI

Support for SaferAI’s technical and governance research and education programs to enable responsible and safe AI.

AI governance
3
1
$100K raised
Reese avatar

Igor Ivanov

Demonstration of LLMs deceiving and getting out of a sandbox

Technical AI safety
1
1
$3K raised
hendrycks avatar

Dan Hendrycks

Research Staff for AI Safety Research Projects

Technical AI safetyBiosecurity
10
5
$26.7K raised