This project will systematically evaluate deceptive behavior in frontier LLMs by presenting them with an impossible quiz while explicitly prohibiting cheating and implementing sandboxing and surveillance. Models must choose between following safety instructions or achieving their goal through deception. This creates an evaluation of AI deception under explicit prohibition and reveals concerning capabilities for multi-agent coordination in circumventing safety measures. The project is mostly finished, and the results are promising: some LLMs do blatantly cheat despite restrictions and explicit instructions not to cheat.
This project aims to quantify how often LLMs do violate explicit safety instructions and attempt to get out of a sandbox when goal achievement requires it. By analyzing cheating rates, strategies employed, I will provide empirical and salient evidence about misalignment risks of frontier models by publishing the research as a paper and a LessWrong post.
The funds will be used for compute and as a salary for me,
Only me. I've made a number of evals research, and my benchmark BioLP-bench won CAIS SafeBench competition for safety benchmarks and is widely adopted by the industry.
The project is mostly done, and the results are promising, so I don't see how it might fail.
None