Benchmarking and comparing different evaluation awareness metrics

Project summary

LLMs often know when they are being evaluated, and there's evidence that they'll behave differently from how they would in real scenarios if they believe so.

There are various different methods one can use to measure a model's evaluation awareness:

Blackbox: Directly asking it to respond with a binary yes/no, or reporting a distribution, or tracing chain-of-thought for organic evaluation awareness reasoning
Whitebox: Linear probes, SAE features
Ensembles of above methods

However, there is currently no systematic study to compare the reliability of these (and other) approaches. Some current concerns for measurement involve models being bad at reporting probabilities, or models being primed by the questioning itself and having high sensitivity but low specificity (preliminary experiments on Qwen3-32b and R1). Organic chain-of-thought monitoring is more convincing but much rarer, and models might not always talk about being aware.

We want to explore how these different approaches compare, and see which are most informative of models' evaluation awareness abilities. For example, which metrics correlate most with behaviours differences during evaluations (e.g. choosing act deceptively in a setup)?

We plan to use Needham et al. and Laine et al. as starting points, although we plan to also look at setups such as Agentic Misalignment

What are this project's goals? How will you achieve them?

We will publish a paper and a Lesswrong post with the results. The research will hopefully help the AI safety research community better quantify evaluation awareness capabilities, and help build more trustworthy alignment evaluations.

How will this funding be used?

The money will first go to compute (API costs and vast.ai GPU hours). More money means we will be able to measure and compare more methods and more frontier models (which are most relevant but expensive).

Roughly, our estimates are:

API costs: $4.5-8k
GPU time: $1-2.5k

If the goal is met and there's extra, it will be used as salary for Jord and Igor.

Who is on your team? What's your track record on similar projects?

Igor Ivanov, worked on making scheming evaluations more realistic. Previously made BioLP-bench which won CAIS SafeBench competition and is widely adopted.

Jord Nguyen, worked on probing evaluation awareness during Pivotal fellowship. Previously worked on DarkBench while at Apart.

What are the most likely causes and outcomes if this project fails?

This might be dual use, if used to create better capability benchmarks that frontier companies can hill-climb on.
If evaluation awareness is more conceptually fuzzy in the future (e.g. if models are constantly being monitored and evaluated during deployment), this might be less useful.

How much money have you raised in the last 12 months, and from where?

None.