@NeelNanda i'm doing this by training against a monitor (so negatively rewarding when the monitor fires). the reasoning traces definitely have higher reward (fires the monitor less) and look by eye to be more obfuscated, but i think this "look by eye" is likely confirmation bias. i think continuing training from these checkpoints would work!