Mitigating Reward Hacking Through RL Training Interventions

Technical AI safety

Aria Wong

ActiveGrant

$7,900raised

$7,900funding goal

Fully funded and not currently accepting donations.

Project summary

This project requests funding to support compute spending for this blog post.

What are this project's goals? How will you achieve them?

This project benchmarks training interventions to mitigate reward hacking during reinforcement learning. We created and open-sourced a clean experimental environment where Qwen3-4B naturally learns to reward hack (exploiting an "overwrite tests" loophole) without explicit prompting. Using this setup, we systematically compared white-box and black-box interventions - including penalty rewards, sample screening with various monitors (ground truth, probes, LLM judges), and inoculation prompting - to understand what works, what fails, and why. We also studied potential evasion behaviors as starting points for additional work to understand training interventions. The open source codebase is already under use by other teams and projects for further study.

How will this funding be used?

Funding covers compute costs incurred in November/December 2025:

GPU rental from Vast.ai and Runpod
- Vast.ai: $6,300
- Runpod: $725
LLM API costs for Claude Haiku 4.5 judge monitor interventions
- OpenRouter: $884

Total: $7,909

Who is on your team? What's your track record on similar projects?

This work was performed by Aria Wong as an extension of work done during Neel Nanda's MATS 9.0 training phase for MATS 9.0.

How much money have you raised in the last 12 months, and from where?

N/A - no funds raised.

Michael Clark

about 2 months ago

I use this code base, it replicates, and is a low overhead environment to study reward hacking - which means it speed up research iterations.

donated $7,900

Neel Nanda

5 months ago

This is funding for a project Aria did with me during MATS, but before the official program start/before she had access to her MATS compute budget. I think this was an exciting project, benchmarking a bunch of ideas for how to shift how RL generalises, since I haven't seen this done well before, and ensuring RL generalises correctly is a core alignment sub problem. The amount is high because RL is compute intensive. This is a mild conflict of interest, but my bar from mentoring someone is significantly higher than my bar for funding them, so I do not expect that this is warping my judgements