I use this code base, it replicates, and is a low overhead environment to study reward hacking - which means it speed up research iterations.
This project requests funding to support compute spending for this blog post.
This project benchmarks training interventions to mitigate reward hacking during reinforcement learning. We created and open-sourced a clean experimental environment where Qwen3-4B naturally learns to reward hack (exploiting an "overwrite tests" loophole) without explicit prompting. Using this setup, we systematically compared white-box and black-box interventions - including penalty rewards, sample screening with various monitors (ground truth, probes, LLM judges), and inoculation prompting - to understand what works, what fails, and why. We also studied potential evasion behaviors as starting points for additional work to understand training interventions. The open source codebase is already under use by other teams and projects for further study.
Funding covers compute costs incurred in November/December 2025:
GPU rental from Vast.ai and Runpod
Vast.ai: $6,300
Runpod: $725
LLM API costs for Claude Haiku 4.5 judge monitor interventions
OpenRouter: $884
Total: $7,909
This work was performed by Aria Wong as an extension of work done during Neel Nanda's MATS 9.0 training phase for MATS 9.0.
N/A - no funds raised.
Michael Clark
6 days ago
I use this code base, it replicates, and is a low overhead environment to study reward hacking - which means it speed up research iterations.
Neel Nanda
4 months ago
This is funding for a project Aria did with me during MATS, but before the official program start/before she had access to her MATS compute budget. I think this was an exciting project, benchmarking a bunch of ideas for how to shift how RL generalises, since I haven't seen this done well before, and ensuring RL generalises correctly is a core alignment sub problem. The amount is high because RL is compute intensive. This is a mild conflict of interest, but my bar from mentoring someone is significantly higher than my bar for funding them, so I do not expect that this is warping my judgements