Project summary
Adversarial RL debate training on LLMs as a reward-hacking mitigation.
What are this project's goals? How will you achieve them?
I will test whether debate, in training, mitigates reward hacking robustly under optimization. Reward hacking is anti-inductive: non-adversarial mitigations (CoT monitoring, KL penalties, prompt inoculation) patch specific known causes, so the surviving strategies are more general. Debate instead aims to make deliberate hacking an unstable equilibrium.
Setup: We train a model with GRPO in an environment built to elicit reward hacking. Per group of rollouts, we randomly pair completions and they debate against a short constitution (e.g. "follow user intent"), each arguing its own solution to the task is more adherent. A frozen judge-LM scores the debate; the judgment enters the reward. Debate is synchronous and symmetric, with debate and solution tokens getting independent gradients. I will benchmark debate against other mitigations on small LMs (Qwen3 4B) and red-team for environments that break each.
Full proposal: https://raul.net.br/static/pdfs/debate_proposal.pdf
How will this funding be used?
Compute (mostly vast.ai GPU time), coding-agent subscriptions, eventual conference attendance, API credits, etc.
Who is on your team? What's your track record on similar projects?
Me, Raul Cavalcante Dinardi (grant recipient), being supervised by my research advisor, Artur Jordão. I'm already running experiments with preliminary results from sub-$1 training runs. I also have a first-author workshop paper.
What are the most likely causes and outcomes if this project fails?
The main failure cause is debate training being too shallow, small open models aren't capable enough for proper debate, debate either can't surface the opponent's reward hacking, or judges are too noisy a signal outside finetuned environments. If that is the case, I would publish a blog post with the negative results as well as publishing tangential findings.
How much money have you raised in the last 12 months, and from where?
700 reais (~140$)/month undergraduate research stipend from PICME/CNPq