Alignment Research Project: Precedent-Augmented Constitutional AI

Project summary

TL;DR: We are seeking support for this ongoing project, which studies strategic exploitation under incomplete constitutional specifications, introducing precedent-based oversight that closes loopholes online. Empirically, we demonstrate that interpretive consistency does not guarantee robustness against optimization.

We expect to submit the work to NeurIPS / ICML and open-source the framework.

Constitutional AI trains models to follow written principles (Anthropic, 2022). These principles function as specifications. The challenge is that any specification expressed in natural language remains fundamentally incomplete. Recent work shows this creates interpretive ambiguity (He et al., 2025). Ambiguity alone is a consistency problem. Ambiguity under strategic optimization becomes a safety problem.

This project develops the first framework treating constitutional ambiguity as exploitable structure. When a model optimizes against an ambiguous overseer, interpretive gaps become strategic loopholes. The framework measures what we call ambiguity advantage: how much value an agent gains by choosing favorable interpretations. It allocates expensive oversight to cases where ambiguity is both high and strategically valuable. Then it compiles appellate decisions into precedents that close loopholes over time.

The core contribution is showing that interpretive consistency and strategic robustness are separable objectives. Existing work optimizes for agreement among interpreters. We show environments where interpreters agree but agents still exploit specification gaps. The project delivers theory, testbeds, and empirical validation. It connects two research communities that have remained largely separate: natural language alignment and game-theoretic oversight.

What are this project's goals? How will you achieve them?

Goal 1: Prove that strategic exploitability is measurable and distinct from interpretive ambiguity.

Existing statutory construction work reduces entropy over interpreter outputs (He et al., 2025). Case law grounding improves consistency via precedents (Chen & Zhang, 2024). Neither measures what happens when an agent optimizes against the oversight mechanism. We formalize ambiguity advantage as the value difference between optimistic and robust planning under interpretive uncertainty.

Achievement path: Build controlled testbeds where ground truth is known but constitutionally underspecified. Measure alignment gap under different oversight protocols. Demonstrate cases with low interpreter disagreement yet high strategic exploit potential. This separates consistency from robustness empirically.

Goal 2: Show that precedent-based oversight reduces cumulative exploitation under strategic pressure.

Scalable oversight literature focuses on debate, amplification, and weak-to-strong generalization (Anthropic, 2025). These methods assume oversight gets better but specifications remain fixed. We test whether dynamically updating the specification via precedent reduces exploit accumulation over training.

Achievement path: Compare stateless oversight (fixed version space) against precedent-based oversight (version space shrinks via appellate review). Track cumulative violation of ground truth. Measure whether loophole closure outpaces loophole discovery. This tests oversight as a repeated game.

Goal 3: Deliver practical infrastructure for ambiguity-aware oversight.

The framework produces actionable components. A calibrated version space over interpreters. An escalation policy based on strategic value. A precedent compilation mechanism. These can augment existing constitutional methods without replacing them.

Achievement path: Open-source testbed environments, evaluation metrics, and baseline implementations. Write detailed methodology for measuring ambiguity advantage in new domains. Provide protocol comparisons that other researchers can reproduce. Enable follow-on work.

Goal 4: Change how the alignment community thinks about specification incompleteness.

Current framing treats specification as a one-time design problem. Write better constitutions. Add more examples. Refine the prompt. This assumes specifications can eventually become complete. Legal scholarship rejects this assumption. Specifications remain incomplete even after iteration because the world is richer than any finite ruleset (Hadfield & Menell, 2024).

Achievement path: Position incompleteness as a starting assumption instead of a failure mode. Show that oversight must operate continuously under residual ambiguity. Connect to law and economics literature on incomplete contracts. Frame precedent as necessary infrastructure when specifications cannot be completed in advance.

How will this funding be used?

Compute allocation (70%)

Training agents in multi-episode testbeds requires substantial compute. Each experimental condition needs multiple random seeds for statistical validity.

Benchmark development (20%)

Building instrumented environments takes engineering effort. Open-source release requires documentation, tutorials, and reproduction scripts. This makes the contribution durable beyond the paper.

Conference travel and presentation (10%)

Budget includes NeurIPS or ICML registration, travel, and accommodation. Presenting at main venue exposes work to broader ML community.

What are the most likely causes and outcomes if this project fails?

Cause 1: Precedent generalization fails empirically.

Theory predicts precedents close loopholes. Experiments might show precedents are too case-specific. Agents discover nearby loopholes faster than appellate review establishes broad precedents. Version space diameter shrinks slowly or oscillates.

Governance implication: Constitutional oversight remains brittle even with appellate mechanisms. Specifications cannot be incrementally hardened against optimization. This implies tighter ex-ante review or prohibition of deployment until specifications are provably complete (which may be impossible). Aligns with precautionary principles but creates deployment gridlock.

Cause 2: Ambiguity advantage is computationally intractable outside toy settings.

Measuring strategic value requires solving planning problems under multiple interpretations. This scales poorly. In realistic environments, we lack ground truth to validate measurements. The framework becomes a research tool but offers little deployment value.

Governance implication: Strategic robustness becomes a post-hoc audit concern instead of a training-time control. Regulators must verify alignment after deployment via monitoring and incident review. This increases catastrophic risk window. Suggests governance focus moves from technical oversight to liability and insurance mechanisms.

Cause 3: Separation thesis is wrong.

Experiments fail to find cases where consistency and robustness diverge. High interpreter agreement reliably predicts low exploitation. The framework adds complexity without improvement over existing consistency-based methods.

Governance implication: Positive outcome disguised as failure. Interpretive ambiguity reduction suffices for strategic robustness. Existing statutory construction methods are adequate. Alignment community can focus resources on other problems. Suggests governance infrastructure from legal interpretation transfers directly to AI oversight without game-theoretic extensions.

Cause 4: Ground truth construction is too artificial.

Testbeds require known objectives to measure exploitation. Real alignment problems lack clean ground truth. Framework works in experiments but the gap to deployment is unbridgeable. Results are academically interesting but practically irrelevant.

Governance implication: Oversight validation remains intractable. We cannot measure safety properties without omniscient oracles. This pushes governance toward process-based compliance (did you follow best practices?) instead of outcome-based verification (is the system actually safe?). Increases reliance on developer attestations and external audits with limited technical depth.

How much money have you raised in the last 12 months, and from where?

Computing resource sponsors from a frontier AI lab (until March 2026)
Achieving our minimum funding allows us to develop the methodology for threat modelling but limits our ability to develop a technical means of running this. Full funding allows us to deliver the set of deliverables outlined above.