Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
10

Research Staff for AI Safety Research Projects

Technical AI safetyBiosecurity
hendrycks avatar

Dan Hendrycks

ActiveGrant
$26,700raised
$1,000,000funding goal

Donate

Sign in to donate

Project Overview

CAIS has a strong track record of producing high-quality work in relevant AI safety research topics: transparency, jailbreaking, robustness, evaluating hazardous knowledge, unlearning, etc. 

  1. Representation Engineering: A Top-Down Approach to AI Transparency

  2. Universal and Transferable Adversarial Attacks on Aligned Language Models

  3. The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

  4. Can LLMs Follow Simple Rules?

  5. Testing Robustness Against Unforeseen Adversaries

  6. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Despite our progress, CAIS only currently has 3.5 FTE researchers and several interns. We currently rely on research collaborations, and additional research staff would greatly accelerate our ability to produce high-quality AI safety work. (For instance, we currently have more projects than CAIS research personnel, and several ongoing projects are significantly understaffed.)

Example Ongoing Projects

Here are some example projects which are ongoing; 

  • Superintelligence Evals. During a period of rapid automated AI R&D/an intelligence explosion, all existing measures of intelligence will quickly saturate. We will need measures which scale across multiple orders of magnitude. Otherwise, we’d be effectively flying blind and unaware of improvements in the systems’ rapidly evolving intelligence. This project aims to precisely measure the fluid intelligence of ML systems—even for systems significantly beyond human intelligence. This project will let us estimate and limit the rate of an intelligence explosion/automated AI research.

  • Robust Safeguards for Open Source Models. Keeping AI models open-source is important to reduce concentration and centralization of power. One tension with open-source models, however, is the possibility of malicious users causing catastrophes with powerful AI systems. Recent experiments have shown promising ways to remove catastrophic knowledge from AI systems (while still maintaining general performance) in ways that are resistant to fine tuning. If we can robustly remove catastrophic knowledge from LLMs, this greatly increases the viability of open-source models.

  • Expert-level Virology Benchmark: Ensuring that AI systems cannot create bioweapons involves measuring and removing hazardous knowledge from AI systems. Knowledge can be broken down into theoretical knowledge (Episteme) as well as tacit ability or skill (Techne). WMDP provided a way to measure and remove the theoretical knowledge needed to develop bioweapons. To fully address the problem, we need to develop measures and removal techniques for the tacit ability or skills necessary to develop bioweapons.

    • Wetlab. In conjunction with SecureBio at MIT, we’re planning to develop a benchmark for how well AIs do on virology wet lab techniques. This benchmark will be multimodal and will cover more tacit, procedural knowledge. We’d provide images of various scenes in a lab—desks, graphs, pipettes—and ask questions about what to do next (following wet lab procedures). This gives us better measures of how AIs can assist in wet lab procedures for bioweapons. We also anticipate this benchmark enabling further methods to unlearn knowledge.

    • Drylab. Similarly, we’re planning a collaboration with a biosecurity team at Oxford to benchmark for how well AIs do on dry lab techniques. This means it will focus on bioinformatics and other computational biology problems, similar to SWE-Bench but for virologists. Dry lab knowledge can be useful for making viruses more virulent or deadly, and so measuring LLMs capabilities along this dimension seems wise. We also anticipate this benchmark enabling further methods to unlearn knowledge.

  • Controlling AI Internals. Recent advancements in top-down transparency have enabled the reading and control of AIs’ “minds” [1]. These control techniques have proved successful in improving an AI’s safety in a wide variety of domains: reducing power-seeking behavior, improving robustness to jailbreaking, increasing AIs’ honesty, and so on. However, there currently does not exist general benchmarks which can measure progress and help facilitate the improvement of better control techniques. We propose to develop benchmarks for AI control techniques and facilitate research on internal control, not just output-level control like RLHF. 

  • Robust Defenses to Jailbreaks or Hijacks. As AI agents become increasingly powerful, image hijacks or jailbreaks can lead to loss of control over powerful AI agents, and eventually to catastrophic outcomes. Adversarial robustness has historically been incredibly challenging, with researchers still unable to train adversarially robust MNIST classifiers. Despite this, we’ve developed a novel defense for LLMs and Multi-Modal Models, which so far is the most successful adversarial robustness technique. Preliminary experiments indicate that our defense is robust to jailbreaks and image hijacks of arbitrary strength in a highly reliable fashion. As such, it shows the potential to greatly reduce the risk of AIs aiding malicious users in building bioweapons or loss-of-control over powerful AI agents through hijacking.

  • Other projects are continually being ideated and developed.

Funding Allocation

Funding will be used to hire REs and cover dataset and compute costs. We’re happy to have funders to determine which projects they want their funding to prioritize.

Our Team

Dan Hendrycks (website) is the Executive Director of the Center for AI Safety. He received his PhD from UC Berkeley. Dan contributed the GELU activation function, the default activation in nearly all state-of-the-art ML models including BERT, Vision Transformers, and GPT-3. Dan also contributed the main baseline for OOD detection and benchmarks for robustness (ImageNet-C) and large language models (MMLU, MATH). More recently, Dan was the last author on Representation Engineering and the WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning. 

Steven Basart is the Research Manager at the Center for AI Safety. He received his PhD from UChicago in ML. (website, scholar)

Xuwang Yin is a Research Engineer at the Center for AI Safety. He received his PhD from the University of Virginia. (scholar)

Long Phan is a Research Engineer at the Center for AI Safety. (scholar)

Alice Gatti is a Research Engineer at the Center for AI Safety. She received her PhD from Lawrence Berkeley National Laboratory.

Comments5Donations7Similar8
🥑

Apollo Research

Apollo Research: Scale up interpretability & behavioral model evals research

Hire 3 additional AI safety research engineers / scientists

Technical AI safety
11
12
$339K raised
🐸

SaferAI

General support for SaferAI

Support for SaferAI’s technical and governance research and education programs to enable responsible and safe AI.

AI governance
3
1
$100K raised
Thomas-Larsen avatar

Thomas Larsen

General Support for the Center for AI Policy

Help us fund 2-3 new employees to support our team

AI governance
9
5
$0 raised
peterwildeford avatar

Peter Wildeford

AI Policy work @ IAPS

AI governance
8
3
$10.1K raised
bensnodin avatar

Ben Snodin

The Rethink Priorities Existential Security team: Research Fellow hire

Funding to hire a junior researcher on a 3-5 month contract to help launch projects tackling existential risk

Global catastrophic risks
5
2
$0 raised
Apart avatar

Apart Research

Keep Apart Research Going: Global AI Safety Research & Talent Pipeline

Funding ends June 2025: Urgent support for proven AI safety pipeline converting technical talent from 26+ countries into published contributors

Technical AI safetyAI governanceEA community
30
36
$131K raised
🍋

Jonas Vollmer

AI forecasting and policy research by the AI 2027 team

AI Futures Project

AI governanceForecasting
7
9
$35.6K raised
cais avatar

Center for AI Safety

AI Safety & Society

High-quality, timely articles on AI safety

2
1
$250K raised