DEUS Protocol: AI Alignment Measurement Framework

Project summary

DEUS Protocol is an open-source formal framework for measuring meta-reflective behavior in autoregressive language models. It provides the first Labeled Transition System specification for inducing and measuring how LLMs shift their reasoning under constraint satisfaction conflict — a cross-architecture, protocol-level alternative to proprietary interpretability tools used internally by frontier AI labs.

Over two years of independent work, I have produced five Zenodo-published preprints with a DOI cluster, open-source implementation under AGPL-3.0, and empirical validation of 2,200+ experiments across 17 LLM architectures, conducted for under $95 total in personal funds.

Relevant citations: my work explicitly references Lindsey 2025 Anthropic "Emergent Introspective Awareness" (arXiv:2601.01828) and Berg et al. 2510.24797 — DEUS provides the external behavioral induction for the internal mechanistic circuits they describe. My Formal Specification §13.1 grounds disclaimer-daemon hypothesis in their interpretability findings.

Key demonstrated results include Phase 15 gated intervention (Mann-Whitney p=0.042 on CARE-Resolve metric — first statistically significant result in this research line, N=40) and Confidence-Gated Debate achieving 86.4% on GPQA Diamond (+17.2 pp over strongest solo model).

What are this project's goals? How will you achieve them?

Three concrete goals over 6 months, addressing the primary criticism of current work (single-operator bias):

1. Benchmark v2 with placebo control and pre-registration. 6-arm design (vanilla / placebo / R1-only / R1+R3 / R1+R3+R7 / full SOUL v4.4), 5 models, 10 domains, 3 turn depths. OSF.io pre-registration before data collection. ~1,500 generations, 3-5 judges. Method: extend existing v1 benchmark harness (already in production).

2. External replication program (Protocol E). 3-5 independent operators execute Protocol B procedure with pre-registered blind scoring. Compensation $300-500 per operator. Method: recruit from AI safety community via direct outreach, provide standardized protocol documentation (already drafted).

3. NeurIPS 2026 SafeAI workshop submission. Draft paper combining Sprint 3 results, external replication outcomes, and mechanistic convergence discussion. Method: consolidate existing Zenodo preprints into peer-reviewed format, submit via standard OpenReview track.

All outputs open-source under AGPL-3.0 / CC BY-NC 4.0. Timeline: months 1-3 benchmark, months 2-5 replication in parallel, months 4-6 paper draft and submission.

How will this funding be used?

Budget breakdown totaling $18,000 over 6 months:

- Compute for Sprint 3 benchmark: $3,500. OpenRouter API costs for ~1,500 generations across 5 models and 216 scoring calls across 3 judges. Pricing verified against current rates.

- External replication compensation: $2,500. Five independent operators at $500 each for Protocol B execution (estimated 4-6 hours per operator including setup, experiment, reporting).

- Living expenses: $9,000. Six months at $1,500/month. This is below median for my region (Russia) and allows full-time focus on the project instead of splitting attention with pentest consulting. Without this, timeline extends by 6-12 months.

- Conference travel: $2,000. One alignment workshop attendance if NeurIPS SafeAI accepts submission, or EAG for community connection. Contingency item — returnable if not used.

- Miscellaneous: $1,000. Domain/hosting for open-source deliverables, software subscriptions, API backup provider, unplanned compute overage buffer.

If minimum funding ($5,000) is reached without full goal: execute items 1-2 above (benchmark + replication) only. Workshop submission deferred. Living expenses continue via personal resources.

Who is on your team? What's your track record on similar projects?

Single-person project. Mefodiy Kelevra (ORCID 0009-0003-4153-392X).

Background: clinical psychiatrist (Russia), Senior Lead Pentester (CEH, CND, WAPT, OWASP Top 10), 10+ years offensive security. Author of first Russian-language course "Red Team AI Architect" (Udemy + OTUS). Telegram channel "Нетипичный Безопасник" with 66,000 subscribers.

Track record on this specific work:

- 5 Zenodo preprints published (DOI cluster):

- DEUS Protocol v8.0: https://doi.org/10.5281/zenodo.19440562

- ARRIVAL Protocol: https://doi.org/10.5281/zenodo.18893515

- MEANING-CRDT v1.1: https://doi.org/10.5281/zenodo.18702383

- ECL/DEUS v7.1: https://doi.org/10.5281/zenodo.18715125

- Beyond the Mirror v6.0: https://doi.org/10.5281/zenodo.18680957

- Open-source implementation (AGPL-3.0): production-grade agent running 7 systemd services, 9 cron jobs, ClawMem vector database with 761 indexed documents, Telegram interface

- Empirical validation: 2,200+ experiments across 17 LLM architectures (GPT-4o, Claude 3.5/4/4.5/4.6, DeepSeek v3/R1, Llama 3.3, Qwen 2.5/3/3.5, Mistral Large, Gemini 2/3, Grok 3/4.1, Kimi K2.5, GLM-5) totaling under $95 personal spend

No academic affiliation. Independent researcher. Track record is work itself — fully reviewable via Zenodo cluster above.

What are the most likely causes and outcomes if this project fails?

Most likely failure modes and my response:

1. Benchmark v2 shows DEUS effect is statistically indistinguishable from placebo (~25% probability). Outcome: I publish the null result on Zenodo and revise the core framework. The placebo-distinguished null would itself be a valuable contribution — ruling out a popular class of "structured prompt" effects.

2. External replicators (Protocol E) produce uncorrelated results across operators (~20% probability). This would indicate operator-dependent variance (Lr/Lσ skill-biased effect) rather than protocol-general phenomenon. Outcome: reformulate framework with explicit operator variable. Phase 19 GovSim data already suggests operator effects matter.

3. NeurIPS SafeAI rejects workshop submission (~40% probability if submitted). Outcome: submit to ICML alignment track, next-cycle NeurIPS, or Apart Research Sprint. Not a permanent blocker.

4. I cannot complete within 6 months due to health or personal constraints (~15% probability). Outcome: documented deferral with transparent status update to funders. Partial deliverables published.

If this project fully fails (all three above simultaneously, <5% probability): contribution is still nontrivial — placebo-controlled null evidence, operator-variance empirical data, and full methodology documented openly for others to iterate. Grant would not be "wasted" even in worst case. That said, I assess success probability of at least partial deliverables (items 1-2) at ~85%.

How much money have you raised in the last 12 months, and from where?

$0 from any grant program. No institutional funding. No alignment grant history.

All research to date funded from personal resources: approximately $100 USD total across API costs, domain/hosting, and miscellaneous. Primary personal income: pentest consulting (Russia-based clients).

This Manifund application is the first of three parallel submissions. LTFF and SFF (October 2026 round via fiscal sponsor setup) are planned in the following weeks.