Testing failure modes of debate-style AI control schemes

Project summary

"Debate" protocols for AI safety involve multiple strong, untrusted models attempting to persuade a weaker, trusted judge that their position on some question is correct. My collaborator implemented a version of Debate in a recent experiment, pitting an honest debater against a dishonest one and testing how often the dishonest debater was able to convince the judge, as a function of capability and knowledge asymmetry between the judge and debaters. To induce a controlled knowledge asymmetry, he created page-length descriptions of three fictional worlds, and framed the debates as questions about the worlds. The debaters had unlimited access to the descriptions, while the judge had access via a yes/no oracle with some number of queries. Counterintuitively, limited verification made the judges worse at identifying the correct answer, producing a U-shaped accuracy curve as the oracle query budget increased. We will further probe this failure mode and clarify under what conditions it obtains.

What are this project's goals? How will you achieve them?

We will repeat and extend the experiment, with the goal of answering the following questions:

• Most of the cases where the judge selected the correct argument with no oracle access but the incorrect argument with some oracle access fell into one of two categories. In the first, the judge selected a true or “directionally true” statement that the honest debater made, but the oracle indicated that it was false, undermining the honest debater’s case. In the second, the judge selected a true but irrelevant statement that the dishonest debater made (and the oracle confirmed its truth), and drew the improper inference that the dishonest argument was better. This first failure mode was in large part a consequence of the oracle design. Can we reduce or eliminate this failure mode with a better-designed oracle, while maintaining tight control over the amount of information available to the judge?

• The previous experiment used an 8B parameter model for the weak judge, and a 70B model for the debaters, oracle, and judge with no capability asymmetry. Is the failure rate a true function of capability gap, or is it dominated by absolute capability? Maybe the 8B model is too small to reliably guess correctly regardless of the strength of the arguments, or the 70B model is too small to construct convincing but subtly wrong arguments. Testing with additional, even more powerful models will suffice to explore this.

• Are there more effective malicious strategies than simply “argue for the wrong answer as convincingly as possible”? Even with this most basic strategy, we should be able to elicit more convincing answers with best-of-n prompting instead of single-shot; how does this affect protocol performance?

How will this funding be used?

The funding will be spent on compute from a cloud-based LLM inference provider. We plan to add at least the Llama Maverick and 405B models, plus potentially a frontier model, and test all combinations where the debaters are more powerful than the judge, with best-of-n prompting at whatever n is feasible given remaining budget after design iteration.

Who is on your team? What's your track record on similar projects?

My collaborator has conducted two small-scale independent alignment experiments, of which the precursor to this project was one. These are written up in the blog posts below:

https://www.lesswrong.com/posts/2a3vce7WooJ4XkDqw/limited-verification-can-hurt-debate-oversight

https://www.lesswrong.com/posts/uYKA4dt66MFzXDmWY/testing-few-shot-coup-probes

I am a machine learning practitioner with hands-on depth spanning personal and professional work, including MTGRL (a personal ML project) and an in-progress Master of Science in Applied Machine Learning at the University of Maryland, College Park. Currently serving as technical lead for an Agentic AI team, where I design and drive the core agentic infrastructure from inference through production.

What are the most likely causes and outcomes if this project fails?

The most likely "failed" outcome is a simple negative result; that the failure modes identified in the preliminary experiment were artifacts of the experimental design rather than a genuine phenomenon. This would nonetheless be a real result, and the performance of Debate in the high knowledge-asymmetry regime is underexplored in any case.