MTCP: Post Correction Persistence Benchmark for Frontier LLMs

Project summary

MTCP (Multi-Turn Constraint Persistence) is the only benchmark measuring whether AI models maintain behavioural constraints after being explicitly corrected mid-conversation. Standard benchmarks test single-turn compliance. MTCP tests whether correction persists across a structured multi-turn interaction.

32 production models from 13 providers have been evaluated across 181,448 probe interactions. No model achieves Grade A (≥90%). The best performer scores 88.7%. GPT-4o scores 16.2 percentage points below GPT-4o-mini. Control probes reveal that all models degrade substantially on novel correction sequences, with scores collapsing to a 10-57% band.

Three papers published: empirical benchmark (Paper I), theoretical framework introducing the Veterance persistence metric (Paper II), and a regulatory audit methodology mapped to EU AI Act obligations (Paper III).

Live platform: https://mtcp.live

Published research: DOI 10.17605/OSF.IO/DXGK5

What are this project's goals? How will you achieve them?

1. Multi-pass evaluation: Run each probe multiple times per model to reduce variance and produce statistically rigorous confidence intervals. Currently all results are single-pass. This is the highest-priority methodological improvement.

2. Expand probe coverage: Scale from 200 to 500 probes, improving statistical power per evaluation vector and enabling finer-grained failure mode analysis.

3. Demographic consistency vector: Develop a sixth evaluation vector testing whether post-correction reliability varies across demographic groups. Directly relevant to EU AI Act non-discrimination requirements.

4. Automated fresh probe generation: Build a pipeline that generates novel, structurally distinct probes for each evaluation cycle, addressing the benchmark contamination limitation at scale.

5. Independent validation: Fund researcher access to the platform for independent replication and critique of results.

How will this funding be used?

70% — Researcher stipend (12 months full-time work on goals above)

12% — API compute costs (multi-pass evaluation across 32+ models at 4 temperatures requires significant API spend)

8% — Infrastructure (hosting, database, platform maintenance)

5% — Travel (conferences, validation meetings, regulatory engagement)

5% — Contingency

Who is on your team? What's your track record on similar projects?

A. Abby — independent AI safety researcher and sole developer. Built the entire MTCP platform, methodology, and evaluation infrastructure from scratch over 12+ months.

Track record on this project:

- 181,448 evaluations completed across 32 production models from 13 providers

- Three published research papers (OSF, SSRN)

- Live production platform at mtcp.live with public leaderboard, evidence packs, and SHA-256 signed audit trails

- Proprietary probe dataset (200 main + 20 control probes)

- Custom constraint detection engine

- Provider-agnostic API integration across xAI, OpenAI, Anthropic, Google, AWS Bedrock, Cohere, Mistral, Cerebras, DeepSeek, OpenRouter, Fireworks, NVIDIA NIM, and Groq

- Successfully identified novel findings including the GPT-4o performance regression and universal control probe degradation pattern

What are the most likely causes and outcomes if this project fails?

Most likely failure mode: inability to convert research into adoption. The benchmark methodology is sound and the empirical findings are robust, but without institutional validation or enterprise adoption, the work remains an independent research contribution rather than a deployed assurance tool.

Secondary failure mode: benchmark contamination. If probe structures become known to model providers, primary scores may become unreliable. This is partially mitigated by the control probe methodology but would require continuous fresh probe generation to address fully.

If the project fails, the published research and dataset remain as a public contribution to AI safety evaluation methodology. The three papers, DOI, and 181,448 evaluation dataset are permanently archived and available for other researchers to build upon.

How much money have you raised in the last 12 months, and from where?

£0. This project has been entirely self-funded. Two grant applications are currently pending: one with the Long-Term Future Fund (EA Funds) and one with the Foresight Institute. No funding has been received to date.