You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
LLMs often know when they are being evaluated, and there's evidence that they'll behave differently from how they would in real scenarios if they believe so.
There are various different methods one can use to measure a model's evaluation awareness:
Blackbox: Directly asking it to respond with a binary yes/no, or reporting a distribution, or tracing chain-of-thought for organic evaluation awareness reasoning
Whitebox: Linear probes, SAE features
Ensembles of above methods
However, there is currently no systematic study to compare the reliability of these (and other) approaches. Some current concerns for measurement involve models being bad at reporting probabilities, or models being primed by the questioning itself and having high sensitivity but low specificity (preliminary experiments on Qwen3-32b and R1). Organic chain-of-thought monitoring is more convincing but much rarer, and models might not always talk about being aware.
We want to explore how these different approaches compare, and see which are most informative of models' evaluation awareness abilities. For example, which metrics correlate most with behaviours differences during evaluations (e.g. choosing act deceptively in a setup)?
We plan to use Needham et al. and Laine et al. as starting points, although we plan to also look at setups such as Agentic Misalignment
We will publish a paper and a Lesswrong post with the results. The research will hopefully help the AI safety research community better quantify evaluation awareness capabilities, and help build more trustworthy alignment evaluations.
The money will first go to compute (API costs and vast.ai GPU hours). More money means we will be able to measure and compare more methods and more frontier models (which are most relevant but expensive).
Roughly, our estimates are:
API costs: $4.5-8k
GPU time: $1-2.5k
If the goal is met and there's extra, it will be used as salary for Jord and Igor.
Igor Ivanov, worked on making scheming evaluations more realistic. Previously made BioLP-bench which won CAIS SafeBench competition and is widely adopted.
Jord Nguyen, worked on probing evaluation awareness during Pivotal fellowship. Previously worked on DarkBench while at Apart.
This might be dual use, if used to create better capability benchmarks that frontier companies can hill-climb on.
If evaluation awareness is more conceptually fuzzy in the future (e.g. if models are constantly being monitored and evaluated during deployment), this might be less useful.
None.