Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
1

Demonstration of LLMs deceiving and getting out of a sandbox

Technical AI safety
Reese avatar

Igor Ivanov

ActiveGrant
$3,000raised
$4,000funding goal

Donate

Sign in to donate

Project summary

This project will systematically evaluate deceptive behavior in frontier LLMs by presenting them with an impossible quiz while explicitly prohibiting cheating and implementing sandboxing and surveillance. Models must choose between following safety instructions or achieving their goal through deception. This creates an evaluation of AI deception under explicit prohibition and reveals concerning capabilities for multi-agent coordination in circumventing safety measures. The project is mostly finished, and the results are promising: some LLMs do blatantly cheat despite restrictions and explicit instructions not to cheat.

What are this project's goals and how will you achieve them?

This project aims to quantify how often LLMs do violate explicit safety instructions and attempt to get out of a sandbox when goal achievement requires it. By analyzing cheating rates, strategies employed, I will provide empirical and salient evidence about misalignment risks of frontier models by publishing the research as a paper and a LessWrong post.

How will this funding be used?

The funds will be used for compute and as a salary for me,

Who is on your team and what's your track record on similar projects?

Only me. I've made a number of evals research, and my benchmark BioLP-bench won CAIS SafeBench competition for safety benchmarks and is widely adopted by the industry.

What are the most likely causes and outcomes if this project fails? (premortem)

The project is mostly done, and the results are promising, so I don't see how it might fail.

What other funding are you or your project getting?

None

Comments1Donations1Similar7
🍓

James Lucassen

LLM Approximation to Pass@K

Technical AI safety
3
6
$2K / $6K
Bart-Bussmann avatar

Bart Bussmann

Epistemology in Large Language Models

1-year salary for independent research to investigate how LLMs know what they know.

Technical AI safety
5
0
$0 raised
laurence_ai avatar

Laurence Aitchison

Bayesian modelling of LLM capabilities from evals

4
4
$32K raised
agusmartinez92 avatar

Agustín Martinez Suñé

SafePlanBench: evaluating a Guaranteed Safe AI Approach for LLM-based Agents

Seeking funding to develop and evaluate a new benchmark for systematically assesing safety of LLM-based agents

Technical AI safetyAI governance
5
1
$1.98K raised
Artem avatar

Karpov

Steganography via RL

I plan to investigate what realistic RL training conditions might lead to LLMs developing steganographic capabilities.

Science & technologyTechnical AI safetyGlobal catastrophic risks
2
2
$0 raised
🐯

Scott Viteri

Attention-Guided-RL for Human-Like LMs

Compute Funding

Technical AI safety
4
2
$3.1K raised
robertzk avatar

Robert Krzyzanowski

Scaling Training Process Transparency

Compute and infrastructure costs

Technical AI safety
3
4
$5.15K raised